cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar

Choosing the RNA-Seq modeling parameter

JMP Genomics provides three options for the ANOVA modeling parameter in the RNA-Seq Basic Analysis workflow: Poisson, Generalized Poisson, and Negative Binomial. How does one decide amongst these options? Also, I can never seem to get the Negative Binomal to work. I get the error below on various data sets:

 

ERROR: NEWRAP Optimization cannot be completed

11 REPLIES 11

Re: Choosing the RNA-Seq modeling parameter

Hi @LouisAltamura ,

I take it that you are choosing Count for the Model Data As: option. The default is Poisson. Take a quick look at the distribution of the data, that will help guide which choice to make

 

If the data is overdispersed then one of the other two is more appropriate.

 

Here is an article that describes when Negative Binomial is a better choice for count data: Poisson or Negative Binomial? Using Count Model Diagnostics to Select a Model - The Analysis Factor

 

Generalized Poisson might be a better fit for overdispersed data with many (excessive) zeros. Here is a quick review on this topic: Can Generalized Poisson model replace any other count data models? An evaluation - ScienceDirect

 

If you think there are a lot of genes that have zeros for counts (no expression or undetectable expression), then Generalized Poisson is probably a better choice. Otherwise Poisson would be OK.

 

As for the error you get when choosing Negative Binomial, I get the same error. Please report this to support@jmp.com.

 

Hope that helps a little,

Chris Kirchberg, M.S.2
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com

Re: Choosing the RNA-Seq modeling parameter

Thanks Chris ...

 

I definitely have many genes with zero or no counts in my RNA-Seq data set. When using the Poisson, I am typicall using the Filter Data for Modeling option checked with a cutoff of 0.2. By this approach, I am losing maybe 25% of mapped genes across my sample set.

 

Would you recommend not filtering zero count genes and then using the Generalized Poisson?

-Lou

Re: Choosing the RNA-Seq modeling parameter

Hi Lou,

Hmm...not sure which is most appropriate. Filtering the row (gene) removes it completely. Leaving it allows it to be used for some samples that have counts.  It may be important to keep those genes since if the change in expression is from 0 to some number, it has changed and needs to be kept into the analysis, even if a fold change would be enormous (0 to 20 is a 20 fold change, but may not be a large change in number of transcripts produced). So those genes are more likely off vs. on.

 

I think I would leave the zeros in and use Generalized Poisson, but others may have more appropriate recommendations.

Chris Kirchberg, M.S.2
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com

Re: Choosing the RNA-Seq modeling parameter

Ok thanks. I've definitely been wrestling with this, and others have question the approach as well. Let me try to run the Generalized Poisson +/- filtering and see how it looks.

Re: Choosing the RNA-Seq modeling parameter

So I took one of my data sets, and these are the numbers of differentially expressed genes that I obtained

  • Generalized Poisson, no filtering = 127
  • Generalized Poisson, with filtering = 226
  • Poisson, no filtering = 316
  • Poisson, with filtering = 380

As you move down the bulleted list above, each method includes all genes in the previous method.

It does not appear that including the many thousands of mostly zero count genes helps anything. The Generalized Poisson did not capture any of those not expressed to highly expressed genes as being significant. I can see them in the volcano plots, but they essentially have adjusted p values of zero.

Re: Choosing the RNA-Seq modeling parameter

Thanks Lou,

 

I would expect the switch from Generalized Poisson to Poisson would yield more differentially expressed genes given that Generlaized Poisson is a "correction" of sorts due to overdispersion (more variance than "expected").  The filtering does surprise me a little, but as I think about it, there are fewer genes (fewer statistical tests), so the FDR correction (adjusted p-value) may not be as stringent and thus more genes called significant at that particular cuttoff.

 

As you have noted, the previous method gene lists are included in the next methods gene list (it is a subset of the method below). It is due to the progressive nature of filtering and corrections.

 

On a more practical note, if you are screening through a list and looking for subsets of genes to follow up on, the largest list is likely to include those genes that are on the boarder of being "statistically significant" for that p-value threshold and still might be of interest if the differences are large enough to be useful/meaningful. Meaning, something with a p-value of 0.08 (or 0.1) might not make the cutoff of 0.05, but have a 2 fold difference in expression which could change a cells behavior. It may be that there is too much unexplained variability between the samples to be very confident of that difference or that something unknown is competing with the experiment design (would need to take a look at the variation of expression within each group for that gene to know more).

 

I would definitely include that gene(s) that are on the border in my list of ones to investigate further if it can be afforded. All depends on the purpose of the experiment.

Chris Kirchberg, M.S.2
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com

Re: Choosing the RNA-Seq modeling parameter

Reply Deleted, see previous reply

Chris Kirchberg, M.S.2
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com

Re: Choosing the RNA-Seq modeling parameter

How do you recommend assessing dispersion on an RNA-Seq data set?

Re: Choosing the RNA-Seq modeling parameter

That question has been asked for years. I did find a journal article from 2013 that discusses this topic, but it is rich in equations, which might be a little much to read.

 

Zhang_TOBIOIJ (openbioinformaticsjournal.com)

Chris Kirchberg, M.S.2
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com