cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
l_yampolsky
Level III

Generalized linear model or Nominal Logistic for RNAseq data

Colleagues,

Imagine I have RNAseq data in the form of reads count measured in 2 environments, with 3 replicates each, in a single sequencing experiment. Like this:

Env reads

A         9

A       10

A       11

B        4

B        5

B        6

Of course I can treat each reads count as an independent variable and do a t test and I will have very low power. Instead, I want to be able to treat each read as an observation and then notice that in a single given sequencing effort I got 15 reads in samples from environment B, but 30 reads in samples from environment A. This is highly unlikely to be observed by chance and my power of detecting differences between environments A and B is way, way higher and will increase with the sequencing effort, unlike that of a t-test. (In a typical RAseq experiment number of reads is much higher, for some genes often in hundreds per sample; 2fold count difference will be very significant).

       I want to be able to do this contingency test with a more complex model, perhaps with 3 different independent variables (and replicate as a block effect).

      The easy solution is then this: I create an additional categorical variable, lets call it "Model" with values Observed and Expected, with "Observed" corresponding to the lines above and "Expected" in equal number of lines, but with the column :reads containing the expected count if no effect is significant, i.e., in, this case, average read count (assuming equal effort in each replicate).

     And then run this:

Fit Model( Freq( :reads ), Y( :Model ), Effects( :Env, :rep ), Personality( Nominal Logistic ), Run( Likelihood Ratio Tests( 1 ), Wald Tests( 0 ) ) )

...or this (which gives me identical results, I don't quite understand why, but I am happy with that):

Fit Model( Freq( :reads ), Y( :Model ), Effects( :Env, :rep ), Personality( Generalized Linear Model ), GLM Distribution( Binomial ), Link Function( Logit ), Overdispersion Tests and Intervals( 0 ), Name( "Firth Bias-adjusted Estimates" )(0), Run )

 

Am I doing something ridiculously wrong or does this make sense as a valid test?

If it makes sense, is there perhaps a way to achieve the same without the need to create additional lines with expected counts?

 

Thanks in advance!!

 

L.Yampolsky

8 REPLIES 8
Phil_Kay
Staff

Re: Generalized linear model or Nominal Logistic for RNAseq data

It might help if you attach an example data table with your different analyses. I find that much easier to understand than trying to imagine a data table from a text description.
l_yampolsky
Level III

Re: Generalized linear model or Nominal Logistic for RNAseq data

Here is an example:

Generalized linear model with nominal response example

Data file: Example1Fp52.43_cH8.jmp

 

Expression of a crustacean homolog of human mitochondrial carrier protein Aralar1 was studied in 4 clones of Daphnia exposed to acute hypoxia (3 hypoxia and 3 control replicates). 4 clones x 2 treatments x 3 replicates = 24 rows.

 

First, a 2-way anova was done using RPKMs as the response variable and hypoxia and clone as factors:

 

Fit Model(

            Y( :RPKM ),

            Effects( :Clone, :acuteHypoxia, :Clone * :acuteHypoxia ),

            Personality( Standard Least Squares ),

            Emphasis( Minimal Report ),

            Run(

                        :RPKM << {Lack of Fit( 0 ), Plot Actual by Predicted( 0 ),

                        Plot Regression( 0 ), Plot Residual by Predicted( 0 ),

                        Plot Effect Leverage( 0 )}

            )

)

 

Which results in:

 

Response RPKM

 

Effect Tests

 

Source              Nparm          DF       Sum of Squares         F Ratio            Prob > F        

Clone                          3          3          10222.900     3.6649                        0.0349           

acuteHypoxia             1          1          42430.209     45.6339         <.0001           

Clone*acuteHypoxia 3          3          12083.132     4.3318                        0.0205           

 

OK, things are significant, but the effects of clone and the interaction will not ever survive any kind of multiple test correction (this is just one example of ~6K transcripts in the RNAseq data).

 

Then 24 dummy rows were created with the reads column containing the expected number of reads in case there is no effect of clones or hypoxia and also no difference among the 3 replicates in each combination of clones/hypoxia – i.e. the average number of reads in the actual 24 rows. Dummy rows were labeled “Exp” in the :ObsExp column and the original actual observations were labeled “Obs”. Now we want to know whether, and, if so, due to which independent variables do Obs differ from Exp. To achieve this I run (observe than now the raw :reads, not RPKM’s are used as the Freq’s:

 

Fit Model(

            Freq( :reads ),

            Y( :ObsExp ),

            Effects(

                        :Clone,

                        :acuteHypoxia,

                        :acuteHypoxia * :Clone,

                        :rep[:Clone, :acuteHypoxia]

            ),

            Personality( Nominal Logistic ),

            Run( Likelihood Ratio Tests( 1 ), Wald Tests( 0 ) ),

            SendToReport( Dispatch( {}, "Parameter Estimates", OutlineBox, {Close( 1 )} ) )

)

 

Which results, as expected in way, way more power:

 

Effect Likelihood Ratio Tests

 

Source                             Nparm       DF       L-R ChiSquare           Prob>ChiSq  

Clone                                        3        3          208.563245  <.0001           

acuteHypoxia                          1        1          114.216936  <.0001           

acuteHypoxia*Clone               3        3          54.2042677  <.0001           

rep[Clone,acuteHypoxia]      16       16       92.4764669  <.0001           

 

 

Note: Exact same results are obtained using this:

            Personality( Generalized Linear Model ),

            GLM Distribution( Binomial ),

            Link Function( Logit )

           

Re: Generalized linear model or Nominal Logistic for RNAseq data

@l_yampolsky   Thanks for the question.   This same issue arises in generalized linear mixed models when converting responses from binomial to binary.   It is imperative in that situation to declare the original sample to be a random effect, otherwise results are completely misleading.     

 

Unfortunately this is also the case for your second analysis, as you are effectively assuming every observation is independent when in fact they are clustered per the original sample.    To correct that second analysis, declare the rep effect to be random and you'll also need to change the personality to Standard Least Squares or to Mixed Model if you have JMP Pro and make the response continuous (I would suggest log2 transforming it).    Alternatively, Meichen Dong has an add-in for GLMMs downloadable from here in the community if you want to keep the response Poisson, and she even has an RNASeq example in the doc:   

 

Bottom line, you really only have 16 degrees of freedom for error and you cannot cheat it.

 

You are right to be concerned about multiple testing over the 6K transcripts.  If certain transcripts are known and pre-specified upfront, you can potentially show a single transcript analysis as your first one or modified second one.    For the full set, make a volcano plot and see what you've got.    Another idea is to construct one or more gene or pathway scores (averaging over relevant transcripts) and analyze them.  

Re: Generalized linear model or Nominal Logistic for RNAseq data

Hi @l_yampolsky,

 

To be clear, for each environment you have three samples (replicates). But instead of combining the replicates for each environment and do a t-test like analysis, you want to treat each sample as though the count represents the number of observations of that particular transcript (transcript A) in the population of transcripts from that sample. The next sample from that same environment is like drawing a new set of observations from the same population of transcripts and observing how many are of the same transcript A. Did I understand that correctly?

 

Question: Is each sample from the same individual (i.e. a sample of the population of cells within that individual, thus technical replicates) or are they from different individuals in the same environment (i.e. a sample of each individual and thus representing a sample different populations of cells that have experienced the same environment)?

 

If the samples are from different individuals, I don't think this is valid approach. The counts are the response to some change in environment. It is like counting the number of photons of a particular wavelength from a light source in a test environment and then pooling it with the count of photons of the same wavelength from another, yet different, device of similar light source and also in the same environment. They are not the same light source (or device), but representatives of the same light source and must be treated independently of each other. This would represent device to device variability and must be tracked separately since each device is not an exact copy of the other. If they were, then multiple devices would not be needed for the test since all photons would come from a single source and each sample would be a representative sample of that source.

 

RNAseq is similar in that it is an effort of quantification of entity, or the abundance of a transcript, in a sample. The next sample is not typically a sample from the same source population. It is usually a sample from a different individual and thus must be treated that way. Individual to individual variability. Even samples from the same tissue from the same individual have variability (cells in different states of division, stratification of functionality, etc.) and should not be pooled as though they came from the same population of cells.

 

Ultimately, this all really depends on what the scientific question that is being asked and the experimental design used to discover the answer to that question. You might be able to pool them and do what you want or maybe not. It depends....

Chris Kirchberg, M.S.2
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com
l_yampolsky
Level III

Re: Generalized linear model or Nominal Logistic for RNAseq data

Replicates are from different individuals; biological replicates. And you are absolutely right, pooling them would not be a good idea. However, I can include replicate as a random independent variable into my model, nested within environment treatments, essentially treating this design as a block design - the same experiment repeated 3 times.

 

I am more worrying about the general approach - creating dummy observations with expected "counts" under null hypothesis (i.e., equal counts of reads everywhere) and using a contingency table /nominal logistics model to see if the actual counts differ.

 

I will, as another user suggested, post an example with the actual data.

Re: Generalized linear model or Nominal Logistic for RNAseq data

Thanks for the answers.

 

I don't think you can get around doing the t-test like analysis. Replicates are the source of the variability estimation and degrees of freedom to perform the analysis and does not need to be nested. Nested in JMP terms is this:

 

If the levels of one effect (B) occur only within a single level of another effect (A), then B is said to be nested within A (B[A]).

 

That means you have to have a factor (B) such that a certain level(s) of B level can only be applied within a certain level(s) of Env. So. levels 1 and 2 of factor B would only occur in Env A and levels 3 and 4 of factor B would only occur in Env. B. Replicates are implied to be within the associated level of Env if Env is the model effect. So no need to explicitly assign them into each Env level.

 

JMP will let you do this, but I don't think it is statistically valid or meaningful.

 

Also, putting in replicate as random is also not necessary since it is also implied (the three replicates should be random samples to begin with and it is not allowed in Generalized Linear Model or Generalized Regression).

 

Reads is the Y in this case and I don't think the Freq option is not going to do what you expect it to do (generally avoid that option). Poisson would be the distribution to use Generalized Linear Model or Negative Binomial or Poisson for Generalize Regression (JMP Pro).

 

I take it you will need to do this for 1000s of genes?

 

Thanks for sending an example table. I have created a table with saved scripts for the analysis to get us going on this discussion. I created an additional columns (Replicate and Replicate ID) for use in this discussion.

Chris Kirchberg, M.S.2
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com
l_yampolsky
Level III

Re: Generalized linear model or Nominal Logistic for RNAseq data

I posted an example in the thread above.

Thanks in advance for your valuable input!

 

L.Yampolsky

Re: Generalized linear model or Nominal Logistic for RNAseq data

Thanks, this helps a lot and now makes a little more sense. I will take a look (as well as others, I hope) to do the sanity check of the approach.

Chris Kirchberg, M.S.2
Data Scientist, Life Sciences - Global Technical Enablement
JMP Statistical Discovery, LLC. - Denver, CO
Tel: +1-919-531-9927 ▪ Mobile: +1-303-378-7419 ▪ E-mail: chris.kirchberg@jmp.com
www.jmp.com