Discussions

LouisAltamura · Jun 8, 2023 5:45 PM

JMP Genomics provides three options for the ANOVA modeling parameter in the RNA-Seq Basic Analysis workflow: Poisson, Generalized Poisson, and Negative Binomial. How does one decide amongst these options? Also, I can never seem to get the Negative Binomal to work. I get the error below on various data sets:

ERROR: NEWRAP Optimization cannot be completed

Chris_Kirchberg · Feb 8, 2022 12:30 PM

Hi @LouisAltamura ,

I take it that you are choosing Count for the Model Data As: option. The default is Poisson. Take a quick look at the distribution of the data, that will help guide which choice to make

If the data is overdispersed then one of the other two is more appropriate.

Here is an article that describes when Negative Binomial is a better choice for count data: Poisson or Negative Binomial? Using Count Model Diagnostics to Select a Model - The Analysis Factor

Generalized Poisson might be a better fit for overdispersed data with many (excessive) zeros. Here is a quick review on this topic: Can Generalized Poisson model replace any other count data models? An evaluation - ScienceDirect

If you think there are a lot of genes that have zeros for counts (no expression or undetectable expression), then Generalized Poisson is probably a better choice. Otherwise Poisson would be OK.

As for the error you get when choosing Negative Binomial, I get the same error. Please report this to support@jmp.com.

Hope that helps a little,

Chris Kirchberg, M.S.²
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com

LouisAltamura · Feb 8, 2022 12:41 PM

Thanks Chris ...

I definitely have many genes with zero or no counts in my RNA-Seq data set. When using the Poisson, I am typicall using the Filter Data for Modeling option checked with a cutoff of 0.2. By this approach, I am losing maybe 25% of mapped genes across my sample set.

Would you recommend not filtering zero count genes and then using the Generalized Poisson?

-Lou

Chris_Kirchberg · Feb 8, 2022 12:53 PM

Hi Lou,

Hmm...not sure which is most appropriate. Filtering the row (gene) removes it completely. Leaving it allows it to be used for some samples that have counts. It may be important to keep those genes since if the change in expression is from 0 to some number, it has changed and needs to be kept into the analysis, even if a fold change would be enormous (0 to 20 is a 20 fold change, but may not be a large change in number of transcripts produced). So those genes are more likely off vs. on.

I think I would leave the zeros in and use Generalized Poisson, but others may have more appropriate recommendations.

Chris Kirchberg, M.S.²
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com

LouisAltamura · Feb 8, 2022 12:57 PM

Ok thanks. I've definitely been wrestling with this, and others have question the approach as well. Let me try to run the Generalized Poisson +/- filtering and see how it looks.

LouisAltamura · Feb 8, 2022 02:06 PM

So I took one of my data sets, and these are the numbers of differentially expressed genes that I obtained

Generalized Poisson, no filtering = 127
Generalized Poisson, with filtering = 226
Poisson, no filtering = 316
Poisson, with filtering = 380

As you move down the bulleted list above, each method includes all genes in the previous method.

It does not appear that including the many thousands of mostly zero count genes helps anything. The Generalized Poisson did not capture any of those not expressed to highly expressed genes as being significant. I can see them in the volcano plots, but they essentially have adjusted p values of zero.

Chris_Kirchberg · Feb 8, 2022 03:19 PM

Thanks Lou,

I would expect the switch from Generalized Poisson to Poisson would yield more differentially expressed genes given that Generlaized Poisson is a "correction" of sorts due to overdispersion (more variance than "expected"). The filtering does surprise me a little, but as I think about it, there are fewer genes (fewer statistical tests), so the FDR correction (adjusted p-value) may not be as stringent and thus more genes called significant at that particular cuttoff.

As you have noted, the previous method gene lists are included in the next methods gene list (it is a subset of the method below). It is due to the progressive nature of filtering and corrections.

On a more practical note, if you are screening through a list and looking for subsets of genes to follow up on, the largest list is likely to include those genes that are on the boarder of being "statistically significant" for that p-value threshold and still might be of interest if the differences are large enough to be useful/meaningful. Meaning, something with a p-value of 0.08 (or 0.1) might not make the cutoff of 0.05, but have a 2 fold difference in expression which could change a cells behavior. It may be that there is too much unexplained variability between the samples to be very confident of that difference or that something unknown is competing with the experiment design (would need to take a look at the variation of expression within each group for that gene to know more).

I would definitely include that gene(s) that are on the border in my list of ones to investigate further if it can be afforded. All depends on the purpose of the experiment.

Chris Kirchberg, M.S.²
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com

Chris_Kirchberg · Sep 2, 2022 01:52 PM

Reply Deleted, see previous reply

Chris Kirchberg, M.S.²
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com

LouisAltamura · Feb 8, 2022 12:50 PM

How do you recommend assessing dispersion on an RNA-Seq data set?

Chris_Kirchberg · Feb 8, 2022 03:33 PM

That question has been asked for years. I did find a journal article from 2013 that discusses this topic, but it is rich in equations, which might be a little much to read.

Zhang_TOBIOIJ (openbioinformaticsjournal.com)

Chris Kirchberg, M.S.²
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com

Discussions

Choosing the RNA-Seq modeling parameter

Re: Choosing the RNA-Seq modeling parameter

Re: Choosing the RNA-Seq modeling parameter

Re: Choosing the RNA-Seq modeling parameter

Re: Choosing the RNA-Seq modeling parameter

Re: Choosing the RNA-Seq modeling parameter

Re: Choosing the RNA-Seq modeling parameter

Re: Choosing the RNA-Seq modeling parameter

Re: Choosing the RNA-Seq modeling parameter

Re: Choosing the RNA-Seq modeling parameter

Recommended Articles