analyzing"count" data

dweinber · Jan 20, 2022 04:24 PM

We did a study in which we collected data on publications in my field over a five year period. For each publication we completed a 24 element checklist. The spreadsheet has one column with with integer values ranging from 0-24. I would like to compare how publications scored on the checklist by publication year (coded in a different column). What is an appropriate statistic for such a comparison? I'm getting mixed advice on this.

dale_lehman · Jan 20, 2022 06:27 PM

I would need some more information. Are the 24 items separate measurements or are they somehow combined into a single score? In other words, are you comparing 24 different measures over time or a single measure over time? As for methods, in either case you could try ANOVA using time (year) as a discrete variable and/or regression, using time as continuous (the former should help you decide if the latter is justified).

SDF1 · Jan 21, 2022 03:18 PM

Hi @dweinber ,

Can you share a screenshot of how your data table is structured? This might help to provide some more specific details on how to analyze it as you might need to split or stack your data, depending on how you've structured it.

But, I would agree to do an ANOVA where you use the publication score as your Y value and publication year as the X -- might be easier to treat it as an ordinal variable instead of continuous. If you leave it continuous, you won't be able to perform an ANOVA on Score.

In the below example, I created a fake data table with a column called score, which is a random normal number with mean of 12 and stdev of 3 and a publication year which was a random integer from 2000 to 2020 -- just to give an idea. Then, you do a Fit Y by X and cast Score as Y and Pub Year as X, and then You can do an ANOVA analysis of Score by Year (click on the red hot button next to Oneway... to get your statistical options to review). Be careful to be sure and understand the kinds of tests you need to be aware of for the ANOVA, and don't just take the output it gives as telling the whole story.

Hope this helps!,

DS

dweinber · Jan 21, 2022 04:22 PM

Thanks for the great replies. My data are structured the same as your "fake" table. The main difference between your data and mine, is that in mine "score" ("count in my table) is an integer count with a defined range of zero to 24. The counts are closely clustered near the top of the range. I didn't think ANOVA would be a good choice because the data are discrete and not normally distributed.

dale_lehman · Jan 22, 2022 10:06 AM

I don't worry too much about the data being normally distributed, but if your count data is clustered near the top, then ANOVA which compares the means across years could be misleading. I'd suggest trying transforming your count data so that the differences over time emerge more clearly. Depending on how the count data is distributed, you could even try a discrete binning of the count data. I would always recommend looking at the distribution of the count data to inform whether a transformation is appropriate, and if so, what sort of transformation (logs, bins, etc.) makes sense. I agree you should treat year as an ordinal variable - unless the mean count seems monotonic over time - if so, then continuous might work better.

analyzing"count" data

Re: analyzing"count" data

Re: analyzing"count" data

Re: analyzing"count" data

Re: analyzing"count" data