- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Choosing the RNA-Seq modeling parameter
JMP Genomics provides three options for the ANOVA modeling parameter in the RNA-Seq Basic Analysis workflow: Poisson, Generalized Poisson, and Negative Binomial. How does one decide amongst these options? Also, I can never seem to get the Negative Binomal to work. I get the error below on various data sets:
ERROR: NEWRAP Optimization cannot be completed
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Choosing the RNA-Seq modeling parameter
Hi @LouisAltamura ,
I take it that you are choosing Count for the Model Data As: option. The default is Poisson. Take a quick look at the distribution of the data, that will help guide which choice to make
If the data is overdispersed then one of the other two is more appropriate.
Here is an article that describes when Negative Binomial is a better choice for count data: Poisson or Negative Binomial? Using Count Model Diagnostics to Select a Model - The Analysis Factor
Generalized Poisson might be a better fit for overdispersed data with many (excessive) zeros. Here is a quick review on this topic: Can Generalized Poisson model replace any other count data models? An evaluation - ScienceDirect
If you think there are a lot of genes that have zeros for counts (no expression or undetectable expression), then Generalized Poisson is probably a better choice. Otherwise Poisson would be OK.
As for the error you get when choosing Negative Binomial, I get the same error. Please report this to support@jmp.com.
Hope that helps a little,
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Choosing the RNA-Seq modeling parameter
Thanks Chris ...
I definitely have many genes with zero or no counts in my RNA-Seq data set. When using the Poisson, I am typicall using the Filter Data for Modeling option checked with a cutoff of 0.2. By this approach, I am losing maybe 25% of mapped genes across my sample set.
Would you recommend not filtering zero count genes and then using the Generalized Poisson?
-Lou
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Choosing the RNA-Seq modeling parameter
Hi Lou,
Hmm...not sure which is most appropriate. Filtering the row (gene) removes it completely. Leaving it allows it to be used for some samples that have counts. It may be important to keep those genes since if the change in expression is from 0 to some number, it has changed and needs to be kept into the analysis, even if a fold change would be enormous (0 to 20 is a 20 fold change, but may not be a large change in number of transcripts produced). So those genes are more likely off vs. on.
I think I would leave the zeros in and use Generalized Poisson, but others may have more appropriate recommendations.
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Choosing the RNA-Seq modeling parameter
Ok thanks. I've definitely been wrestling with this, and others have question the approach as well. Let me try to run the Generalized Poisson +/- filtering and see how it looks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Choosing the RNA-Seq modeling parameter
So I took one of my data sets, and these are the numbers of differentially expressed genes that I obtained
- Generalized Poisson, no filtering = 127
- Generalized Poisson, with filtering = 226
- Poisson, no filtering = 316
- Poisson, with filtering = 380
As you move down the bulleted list above, each method includes all genes in the previous method.
It does not appear that including the many thousands of mostly zero count genes helps anything. The Generalized Poisson did not capture any of those not expressed to highly expressed genes as being significant. I can see them in the volcano plots, but they essentially have adjusted p values of zero.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Choosing the RNA-Seq modeling parameter
Thanks Lou,
I would expect the switch from Generalized Poisson to Poisson would yield more differentially expressed genes given that Generlaized Poisson is a "correction" of sorts due to overdispersion (more variance than "expected"). The filtering does surprise me a little, but as I think about it, there are fewer genes (fewer statistical tests), so the FDR correction (adjusted p-value) may not be as stringent and thus more genes called significant at that particular cuttoff.
As you have noted, the previous method gene lists are included in the next methods gene list (it is a subset of the method below). It is due to the progressive nature of filtering and corrections.
On a more practical note, if you are screening through a list and looking for subsets of genes to follow up on, the largest list is likely to include those genes that are on the boarder of being "statistically significant" for that p-value threshold and still might be of interest if the differences are large enough to be useful/meaningful. Meaning, something with a p-value of 0.08 (or 0.1) might not make the cutoff of 0.05, but have a 2 fold difference in expression which could change a cells behavior. It may be that there is too much unexplained variability between the samples to be very confident of that difference or that something unknown is competing with the experiment design (would need to take a look at the variation of expression within each group for that gene to know more).
I would definitely include that gene(s) that are on the border in my list of ones to investigate further if it can be afforded. All depends on the purpose of the experiment.
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Choosing the RNA-Seq modeling parameter
Reply Deleted, see previous reply
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Choosing the RNA-Seq modeling parameter
How do you recommend assessing dispersion on an RNA-Seq data set?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
Re: Choosing the RNA-Seq modeling parameter
That question has been asked for years. I did find a journal article from 2013 that discusses this topic, but it is rich in equations, which might be a little much to read.
Zhang_TOBIOIJ (openbioinformaticsjournal.com)
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com