Hello, everyone.
Today I'll be presenting about the application of JMP
to analyze gene expression single- cell RNA data in melanoma cells.
My name is Catherine Zhou,
and I'm a high school student at Lynbrook High School.
As introduction,
melanoma develops in the cells or melanocytes that produce melanin
and is the most aggressive type of skin cancer,
representing 65% of all deaths from skin cancer.
T cells are a type of white blood cells that develop from stem cells
in the bone marrow and mature in the thymus,
where they multiply and differentiate into helper, regulatory, or cytotoxic T cells
or become memory T cells.
Cytotoxic T cells, which are activated by various cytokines,
bind to and kill infected cells and cancer cells.
In melanoma patients, T cells mount an immune response against the tumor,
but at some point, the responder T cells become ineffective
due to a local immunosuppressive process occurring at the tumor sites.
We'd like to identify the causes behind T cell dysfunction.
Alternative splicing is a regulatory process
essential to generate transcriptome diversity.
Misregulation contributes to disease and cancer.
Eukaryotic genes are composed of int ronic and exonic sequences,
as you can see in this diagram.
In the alternative splicing process,
non-coding introns of a gene are selectively removed
and the included exons are combined in the final messenger RNA,
and translation of these different isoforms, where different combinations
result in different proteins and different cellular functions.
Next, traditional bulk sequencing examines the genome of a cell population
such as the cell culture or tissue,
and its output is the average gene expression of the cell population.
On the other hand, single cell sequencing
measures the genomes of individual cells from the population.
Single-cell RNA sequencing, or scRNA-seq,
measures the transcriptomes of each cell in the sample,
which reveals the heterogeneity of thousands of cells
and provides insight into cellular differences at high resolution.
In this study, I'll be analyzing alternative splicing in scRNA-seq data,
which has rarely been studied due to several issues.
ScRNA-seq results in a very large number of cells and sparse data.
However, with the proper statistical tools like JMP and R,
we can take steps to solving these issues.
In this presentation,
I'll be going over my method to find the most significant alternative spicing
or AS events differentiating cancerous T cells from healthy lymph node T cells.
I'll explain my data processing pipeline, and then I'll go over how to use JMP Pro
for predictive modeling, clustering, and visualization.
And then I'll go over the results of the analysis of scRNA-seq dataset
of T cells in murine melanoma.
This is a diagram of the presentation.
First, I'd like to explain my dataset preparation and processing process.
First, I prepared and processed the dataset.
I took the scRNA-seq dataset of cells in lymph nodes with melanoma in mice
from this paper over here,
and then performed read alignment of the F ASTQ files with STAR,
and detected AS events with the pipeline derived from rMATS,
which uses a generalized linear mixed model.
Instead of using the pairwise comparison,
I ran it on a single sample with each individual cell,
effectively quantifying AS in each cell.
Then I used R to quantify exon skipping events with IJC,
which stands for Inclusion Junction Count,
and SJC, which stands for Skipping Junction Count.
As you can see below in this diagram, an inclusion junction is detected
when the read includes both the flanking sequence,
which is this black sequence over here,
and the exon sequence, which is this E sequence over here,
while a skipping junction count is detected
when the read only includes the two flanking sequences
and just entirely skips the exon.
Then I create a matrix of this data.
Next, I subsetted the T cells
using cell labels previously defined by the authors of the dataset.
To filter AS events, I applied the following rules,
and I did this in R.
There must be at least ten reads per junction.
The event has to be detected in more than ten cells,
and the event cannot have no variability across the cells.
This moved around 75% of the exons.
This shows that filtering is important because it removed exons
with low coverage and were basically unimportant.
This reduced the feature size, which will improve the performance
of predictive modeling and analysis further in the future.
Then I calculate the PSI or percent spliced in value
with this equation: IJC over IJC plus SJC.
Before I go into the next step of predictive modeling,
I wanted to explain the reasons I decided to apply predictive modeling in JMP
for this specific problem.
There have been several studies that successfully used machine learning models
for analyzing RNA-seq data, which achieved high accuracy.
I hypothesized that it would also be accurate for scRNA-seq data.
There are many advantages of using predictive modeling as well.
It can extract meaningful features from huge datasets,
classify data and predict outcomes with supervised learning,
and recognize underlying relationships of data, like in neural networks.
JMP Pro also provides a great interface for exploring these different models,
allowing you to easily apply sophisticated algorithms
to large datasets quickly and accurately.
It also allows for model comparison and screening.
Finally, it generates some visual and interactive reports
that reduce the black box effect of machine learning models,
which means that you don't really know what happens in the model,
you just know the output, which is not very useful.
Our dataset had 8,044 variables or exons, and 1,014 rows or cells.
It was difficult to run analysis with such a large number of columns,
so I used a bootstrap forest for variable selection.
A bootstrap forest is very computationally efficient
and it can operate quickly over wide datasets,
and it uses random sampling with replacement,
causing it to be very robust and accurate.
I could have also used predictor screening,
which is another function in JMP,
but I found that manually running it using a tuning design table
was more computationally efficient
and allowed me to tune the number of trees as well.
But I found that the number of trees doesn't make a huge difference.
In general, the more trees you use leads to better results,
but their improvement decreases as the number of trees increases.
At a certain point, the benefit in prediction performance
from learning more trees will be lower than the cost in computation time.
For the 8,044 exons, I tuned the number of trees,
and I found that around 50 trees resulted in the highest accuracy.
And then I took the top 100 exons from the column contributions,
and then I ran it again with the top 50 exons,
and then again with the 50 exons, comparing the accuracy as you go.
This shows the JMP reports for tuning trees and variable selection.
As you can see, I use the tuning design table and change the number of trees.
I looked at the entropy R square and these different metrics.
I believe 50 trees was the best result.
I took the top 100 exons and then I ran the bootstrap forest with 30 trees
because I did another tuning design table.
For each iteration,
I did a separate tuning design table to find the best number of trees.
As you can see, the entropy R square increases
as the number of exons was decreased.
As I selected the top three different exons,
it shows that it doesn't really matter if it's the top 10 or top 50 or top 100,
it will still result in a high accuracy.
The misclassification rate varied.
The RMS error also decreased,
and the average absolute error also decreased.
With these metrics,
I decided to use the top 10 exons to run the next steps of my analysis.
I used the Model Screening function in JMP Pro
to run the top 10 exons on all the different models that were possible.
It resulted in a high accuracy across the board.
I looked at the prediction profiler to determine how the exons
caused tumor or cancerous cells to be different.
Now, I'll be performing a demo of this process.
Let me open the top 10 exons.
In JMP Pro, you can go to Analyze, Predictive Modeling,
and then Model Screening.
I'll just hit Recall because I used this before
and it automatically updates with your previous settings.
And then it also did K-fold cross v alidation.
This process is very quick, it only takes around 20 seconds or 10 seconds.
All right.
As you can see, the neural boosted had the highest accuracy.
Next was generalized regression lasso.
It split the dataset into training and validation.
You can see that boosted tree had the highest accuracy for training,
which makes sense because it basically fits the data really well
when you use a tree.
But for validation, let's see, the most accurate was in neural boosted.
With this information,
I decided to use a neural boosted to do the final classification of cells
for tumor cells or normal cells,
because it would result in the highest accuracy.
I clicked Run Selected,
and then I used the validation column that I previously created
with the Make V alidation function in the Predictive Modeling tab.
Here are the results of the neural network.
It was very quick, again.
You can see that there are the ROC graphs.
It basically has a very high AUC.
If you take a look at the prediction profiling, the profiler,
this shows the different ways...
Let me explain this better.
As you move your cursor along, if you look at A POBR, this gene,
when the inclusion of this exon increases,
then there's a higher probability that the cell will be a tumor cell.
Over here, for this, this is also the same trend.
For this one, this is the inverse trend.
For PCDH9, as the inclusion of this exon increases,
the probability that it will result in a tumor cell decreases.
Sorry.
This allows you to look at the role of these exons
in causing the function of tumor and normal cells to be different.
This is a very powerful visualizer,
and it can show the different relationships between these cells.
You can also just save the formulas to the...
Sorry, one second.
You do Publish Prediction Formula, and that saves it in the formula depot.
Okay.
Next, I'd like to look at the difference in expression
as it goes from 5 days to 8 days to 11 days.
This dataset also provided the cells at different time points.
If you look at here,
this column shows that the different times are available for us to analyze.
I subsetted the tumor cells.
Let me find that over here.
Tumor only.
I compared the distribution of the different exons.
Let me show you the graph.
As you can see,
there are several exons that increase in expression over time.
Over here , SLAMF9 or T OMM20 increases dramatically,
and these also increase.
There are also exons that increase and then decrease,
which show that they have an optimal period of time
that they're the most active.
I'll be going over these specific genes and functions later in my presentation.
Let me go back to my presentation.
All right.
Here, I also ran clustering on the top 10 exons dataset.
I did K- means clustering,
and this more accurately classified the tumor versus normal cells.
As you can see here,
the blue is normal cell and then the red is the tumor cell.
This shows that there are possibly different subsets of tumor cells
that have different functions.
I also ran UMAP on all the cells.
UMAP is another dimensional reduction formula.
This was ran with a JMP add- in that was created by this author.
I will link it in my presentation in the community.
Here is the gene enrichment analysis that I performed.
As you can see, the top function of all of these genes
was RNA import into the mitochondrion.
The second most important one was regulation of leukocyte degranulation.
T cells are a type of leukocyte.
It can show that possibly these leukocytes or T cells are degranulated,
and that causes the cells to become tumors.
Also, malignant tumors selectively retain mitochondrial genome and ETC function.
That can also explain why RNA import into the mitochondrion
is an important function.
This is another graph of the top GO biological processes.
Over here, the top one is cellular process.
There are also biological adhesion, regulation, and immune system processes
that can be further explored.
I'd like to explain the most notable genes.
These genes are basically the top genes that were in the column contributions.
S TPBX2 is involved in intracellular trafficking,
control of SNARE complex assembly,
and the release of cytotoxic granules by natural killer cells.
This can show that the T cells are involved in some sort of trafficking.
Or the cytotoxic T cells can use this exon
to regulate the tumor versus normal functions.
SLAMF9 nine encodes a member
of the signaling lymphocytic activation molecule family and its transmembrane.
WARS is a tryptophanyl- tRNA synthetase
that catalyzes the aminoacylation of tRNA with tryptophan
and is induced by interferons.
All of these genes have a role in the immune system and T cell function.
We can study these further
and determine if they actually have a important function with in vitro tests.
Next, I explained this earlier,
but I analyzed the tumor gene expression over time
and T OMM20, which was the one that had the most dramatic inclusion,
is actually implicated in the translocase of outer mitochondrial membrane complex
that facilitates cancer aggressiveness and therapeutic resistance
in chondrosarcoma, which is a type of cancer.
You can see that TOMM 20 causes cancers to be more aggressive.
It's really interesting how our basic analysis with JMP Pro
resulted in this exon being the most important.
This shows the effectiveness of JMP in these types of studies.
In conclusion, the exploration of this dataset facilitated by JMP
demonstrates the role of genes and exons in T cells in melanoma.
A good thing is that the code can be replicated in Python.
JMP allows that for more robust or detailed analysis.
These potential genes like TOMM 20 or STPX9
can aid in the discovery of novel and personalized approaches
to cancer treatment.
And then we can perform in vitro testing on the top genes.
Thank you for listening to my presentation.
I will provide all the code and R scripts in my presentation link.