cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
Application of JMP to Analyze Gene Expression Single-Cell RNA-seq Data in Melanoma Cells (2022-US-30MP-1113)

Catherine Zhou, Student, Lynbrook High School

 

This presentation explores the use of JMP Pro to analyze gene expression of single-cell RNA sequencing data from murine melanoma samples. Single-cell RNA sequencing is a next-generation sequencing technology that reveals the heterogeneity between individual cells and permits comparison of the transcriptores of different cell types. While there are many statistical tools to analyze scRNA-seq data, JMP provides many streamlined methods for initial analysis and visualization of data. The objectives for this study were to determine cell subtypes from a sample of T cells extracted from cancer-infiltrated lymph nodes and adjacent tissue and find differentially expressed genes. PCA and clustering methods were used for preliminary exploration of cell heterogeneity. Predictive modeling was used to determine if cells can be accurately classified as cancerous by gene expression profiles. Next, T cells taken from different time points were analyzed to study the trajectory of gene expression associated with functional changes in these cells. Significant gene signatures were explored with pathway enrichment analysis and other downstream tools.

 

 

 

Hello,  everyone.

Today  I'll  be  presenting about  the  application  of   JMP

to  analyze  gene  expression single- cell  RNA  data  in  melanoma  cells.

My  name  is  Catherine  Zhou,

and  I'm  a  high  school  student at  Lynbrook  High  School.

As  introduction,

melanoma  develops  in the  cells or  melanocytes  that  produce  melanin

and  is  the  most  aggressive  type of  skin  cancer,

representing  65%  of  all  deaths from  skin  cancer.

T  cells  are  a  type  of  white  blood  cells that  develop  from  stem  cells

in  the  bone  marrow and  mature  in  the  thymus,

where  they  multiply  and  differentiate  into helper,  regulatory,  or  cytotoxic  T  cells

or  become  memory  T  cells.

Cytotoxic  T  cells, which  are  activated  by  various  cytokines,

bind  to  and  kill infected  cells  and  cancer  cells.

In  melanoma  patients,  T  cells  mount an  immune  response  against  the  tumor,

but  at  some  point, the  responder  T   cells   become  ineffective

due  to  a  local  immunosuppressive  process occurring  at  the  tumor  sites.

We'd  like  to  identify  the  causes behind  T  cell  dysfunction.

Alternative  splicing is  a  regulatory  process

essential  to  generate transcriptome  diversity.

Misregulation  contributes to  disease  and  cancer.

Eukaryotic  genes  are  composed of  int ronic  and  exonic  sequences,

as  you  can  see  in  this  diagram.

In  the  alternative  splicing  process,

non-coding  introns  of  a  gene are  selectively  removed

and  the  included  exons are  combined  in  the  final  messenger  RNA,

and translation  of  these  different isoforms,  where  different  combinations

result  in  different  proteins and  different  cellular  functions.

Next,  traditional  bulk  sequencing examines  the  genome  of  a  cell  population

such  as  the  cell  culture  or  tissue,

and  its  output  is  the  average  gene expression  of  the  cell  population.

On  the  other  hand,  single  cell  sequencing

measures  the  genomes of  individual  cells  from  the  population.

Single-cell  RNA  sequencing, or  scRNA-seq,

measures  the  transcriptomes of  each  cell  in  the  sample,

which  reveals  the  heterogeneity of  thousands  of  cells

and  provides  insight  into cellular  differences  at  high  resolution.

In  this  study,  I'll  be  analyzing alternative  splicing in  scRNA-seq data,

which  has  rarely  been  studied due  to  several  issues.

ScRNA-seq  results  in  a  very  large number  of  cells  and  sparse  data.

However,  with  the  proper  statistical  tools like  JMP  and  R,

we  can  take  steps  to  solving  these  issues.

In  this  presentation,

I'll  be  going  over  my  method  to  find the  most  significant  alternative  spicing

or  AS  events  differentiating  cancerous T  cells  from  healthy  lymph  node  T cells.

I'll  explain  my  data  processing  pipeline, and  then  I'll  go  over  how  to  use  JMP  Pro

for  predictive  modeling, clustering,  and  visualization.

And  then  I'll  go  over  the  results of  the  analysis  of   scRNA-seq   dataset

of  T cells  in murine  melanoma.

This  is  a  diagram  of  the  presentation.

First,  I'd  like  to  explain  my   dataset preparation  and  processing  process.

First,  I  prepared and  processed  the   dataset.

I  took  the   scRNA-seq  dataset  of  cells in  lymph nodes  with  melanoma  in  mice

from  this  paper  over  here,

and  then  performed  read  alignment of  the  F ASTQ  files  with  STAR,

and  detected  AS  events with  the  pipeline  derived  from  rMATS,

which  uses  a  generalized linear  mixed  model.

Instead  of  using  the  pairwise  comparison,

I  ran  it  on  a  single  sample with  each  individual  cell,

effectively  quantifying  AS  in  each  cell.

Then  I  used  R  to  quantify exon  skipping  events  with  IJC,

which  stands  for  Inclusion  Junction  Count,

and  SJC,  which  stands for  Skipping  Junction  Count.

As  you  can  see  below  in  this  diagram, an  inclusion  junction  is  detected

when  the  read  includes both  the  flanking  sequence,

which  is  this  black  sequence  over  here,

and  the  exon  sequence, which  is  this  E  sequence  over  here,

while  a  skipping  junction  count is  detected

when  the  read  only  includes the  two  flanking  sequences

and  just  entirely  skips  the  exon.

Then  I  create  a  matrix  of  this  data.

Next,  I  subsetted  the  T  cells

using  cell  labels  previously  defined by  the  authors  of  the   dataset.

To  filter  AS  events, I  applied  the  following  rules,

and  I  did  this  in  R.

There  must  be  at  least ten  reads  per  junction.

The  event  has  to  be  detected in  more  than  ten  cells,

and  the  event  cannot  have no  variability  across  the  cells.

This  moved  around  75%  of  the  exons.

This  shows  that  filtering  is  important because  it  removed  exons

with  low  coverage and  were  basically  unimportant.

This  reduced  the  feature  size, which  will  improve  the  performance

of  predictive  modeling  and  analysis further  in  the  future.

Then  I  calculate  the  PSI or  percent  spliced  in  value

with  this  equation: IJC  over  IJC  plus  SJC.

Before  I  go  into  the  next  step of  predictive  modeling,

I  wanted  to  explain  the  reasons  I  decided to  apply  predictive  modeling in  JMP

for  this  specific  problem.

There  have  been  several  studies  that successfully  used  machine  learning  models

for  analyzing  RNA-seq  data, which  achieved  high  accuracy.

I  hypothesized  that  it  would  also be  accurate  for   scRNA-seq  data.

There  are  many  advantages of  using  predictive  modeling  as  well.

It  can  extract  meaningful  features from  huge   datasets,

classify  data  and  predict  outcomes with  supervised  learning,

and  recognize  underlying  relationships of  data,  like  in  neural  networks.

JMP Pro  also  provides  a  great  interface for  exploring  these  different  models,

allowing  you  to  easily apply  sophisticated  algorithms

to  large   datasets  quickly  and  accurately.

It  also  allows for  model  comparison  and  screening.

Finally,  it  generates some  visual  and  interactive  reports

that  reduce  the  black  box  effect of  machine  learning  models,

which  means  that  you  don't  really  know what  happens  in  the  model,

you  just  know  the  output, which  is  not  very  useful.

Our   dataset  had  8,044  variables  or  exons, and  1,014  rows  or  cells.

It  was  difficult  to  run  analysis with  such  a  large  number  of  columns,

so  I  used  a  bootstrap  forest for  variable  selection.

A  bootstrap  forest is  very  computationally  efficient

and  it  can  operate quickly  over  wide   datasets,

and  it  uses  random  sampling with  replacement,

causing  it  to  be  very  robust  and  accurate.

I  could  have  also  used predictor  screening,

which  is  another  function  in  JMP,

but  I  found  that  manually  running  it using  a  tuning  design  table

was  more  computationally  efficient

and  allowed  me  to  tune the  number  of  trees  as  well.

But  I  found  that  the  number  of  trees doesn't  make  a  huge  difference.

In  general,  the  more  trees  you  use leads  to  better  results,

but  their  improvement  decreases as  the  number  of  trees  increases.

At  a  certain  point, the  benefit  in  prediction  performance

from  learning  more  trees  will  be  lower than  the  cost  in  computation  time.

For  the  8,044  exons, I   tuned the  number  of  trees,

and  I  found  that  around  50  trees resulted  in  the  highest  accuracy.

And  then  I  took  the  top  100  exons from  the  column  contributions,

and  then  I  ran  it  again with  the  top  50  exons,

and  then  again  with  the  50  exons, comparing  the  accuracy  as  you  go.

This  shows  the  JMP  reports for  tuning  trees  and  variable  selection.

As  you  can  see,  I  use  the  tuning  design table  and  change  the  number  of  trees.

I  looked  at  the  entropy R  square and  these  different  metrics.

I  believe  50  trees  was  the  best  result.

I  took  the  top  100  exons  and  then I  ran  the   bootstrap  forest  with  30  trees

because  I  did  another  tuning  design  table.

For  each  iteration,

I  did  a  separate  tuning  design  table to  find  the  best  number  of  trees.

As  you  can  see, the  entropy  R  square  increases

as  the  number  of  exons  was  decreased.

As  I  selected the  top  three  different  exons,

it  shows  that  it  doesn't  really  matter if  it's   the  top  10  or  top  50  or  top  100,

it  will  still  result  in  a  high  accuracy.

The  misclassification  rate   varied.

The  RMS  error  also  decreased,

and  the  average  absolute  error also  decreased.

With  these  metrics,

I  decided  to  use  the  top  10  exons to  run  the  next  steps  of  my  analysis.

I  used  the  Model  Screening function  in  JMP  Pro

to  run  the  top  10  exons  on  all the  different  models  that  were  possible.

It  resulted  in  a  high  accuracy across  the  board.

I  looked  at  the  prediction  profiler to  determine  how  the  exons

caused  tumor  or  cancerous  cells to  be  different.

Now, I'll  be  performing a  demo  of  this  process.

Let  me  open  the  top  10  exons.

In JMP  Pro,  you  can  go  to  Analyze, Predictive  Modeling,

and  then  Model  Screening.

I'll just hit  Recall because  I  used  this  before

and  it  automatically  updates with  your  previous  settings.

And  then  it also  did K-fold  cross v alidation.

This  process  is  very  quick,  it  only  takes around  20  seconds  or  10  seconds.

All  right.

As  you  can  see,  the  neural  boosted had  the  highest  accuracy.

Next  was  generalized  regression lasso.

It  split  the   dataset into  training  and  validation.

You  can  see  that  boosted  tree had  the  highest  accuracy  for  training,

which  makes  sense  because  it basically  fits  the  data  really  well

when  you  use  a  tree.

But  for  validation,  let's  see, the  most  accurate  was  in  neural  boosted.

With  this  information,

I  decided  to  use  a  neural  boosted to  do  the  final  classification  of  cells

for  tumor  cells  or  normal  cells,

because  it  would  result in  the  highest  accuracy.

I  clicked  Run  Selected,

and  then  I  used  the  validation  column that  I  previously  created

with  the  Make V alidation  function in  the  Predictive  Modeling  tab.

Here  are  the  results of  the  neural  network.

It  was  very  quick,  again.

You  can  see  that  there  are  the  ROC  graphs.

It  basically  has  a  very  high  AUC.

If  you  take  a  look  at the  prediction  profiling, the profiler,

this  shows  the  different  ways...

Let  me  explain  this  better.

As  you  move  your  cursor  along, if  you  look  at  A POBR,  this  gene,

when  the  inclusion  of  this  exon  increases,

then  there's  a  higher  probability that  the  cell  will  be  a  tumor  cell.

Over  here,   for  this, this is  also  the  same  trend.

For  this  one,  this  is  the  inverse  trend.

For   PCDH9,  as  the  inclusion of  this  exon  increases,

the  probability  that  it  will  result in  a  tumor  cell  decreases.

Sorry.

This  allows  you  to  look at  the  role  of  these  exons

in  causing  the  function of  tumor  and   normal   cells  to  be  different.

This  is  a  very  powerful  visualizer,

and  it  can  show  the  different relationships  between  these  cells.

You  can  also  just  save the  formulas  to  the...

Sorry,  one  second.

You  do  Publish  Prediction  Formula, and  that  saves  it  in  the  formula  depot.

Okay.

Next,  I'd  like  to  look at  the  difference  in  expression

as  it  goes  from  5  days to  8  days  to  11  days.

This   dataset  also  provided the  cells  at  different  time  points.

If  you  look  at  here,

this  column  shows  that  the  different times  are  available  for  us  to  analyze.

I  subsetted  the  tumor  cells.

Let  me  find  that  over  here.

Tumor  only.

I  compared  the  distribution of  the  different  exons.

Let  me  show  you  the  graph.

As  you  can  see,

there  are  several  exons that  increase  in  expression  over  time.

Over   here ,  SLAMF9   or  T OMM20 increases  dramatically,

and  these  also  increase.

There  are  also  exons that  increase  and  then  decrease,

which  show  that  they  have an  optimal  period  of  time

that  they're  the  most  active.

I'll  be  going  over  these  specific  genes and  functions  later  in  my  presentation.

Let  me  go  back  to  my  presentation.

All  right.

Here,  I  also  ran  clustering on  the  top  10  exons  dataset.

I  did  K- means  clustering,

and  this  more  accurately  classified the  tumor  versus  normal  cells.

As  you  can  see  here,

the  blue  is  normal  cell and  then  the  red  is  the  tumor  cell.

This  shows  that  there  are  possibly different  subsets  of  tumor  cells

that  have  different  functions.

I  also  ran  UMAP  on  all  the  cells.

UMAP  is  another dimensional  reduction  formula.

This  was  ran  with  a  JMP  add- in that  was  created  by  this  author.

I  will  link  it  in  my  presentation in  the  community.

Here  is  the  gene  enrichment analysis  that  I  performed.

As  you  can  see,  the  top  function of  all  of  these  genes

was  RNA  import into   the  mitochondrion.

The  second  most  important  one  was regulation  of  leukocyte  degranulation.

T cells  are  a  type  of  leukocyte.

It  can  show  that  possibly  these  leukocytes or  T  cells  are  degranulated,

and  that  causes  the  cells to  become  tumors.

Also,  malignant  tumors  selectively  retain mitochondrial  genome  and  ETC  function.

That  can  also  explain why  RNA  import into  the  mitochondrion

is  an  important  function.

This  is  another  graph of  the  top  GO  biological  processes.

Over  here, the  top  one  is  cellular  process.

There  are  also  biological  adhesion, regulation,  and  immune  system  processes

that  can  be  further  explored.

I'd  like  to  explain the  most  notable  genes.

These  genes  are  basically  the  top  genes that  were  in   the   column  contributions.

S TPBX2  is  involved in  intracellular  trafficking,

control  of  SNARE  complex  assembly,

and  the  release  of  cytotoxic  granules by  natural  killer  cells.

This  can  show  that  the  T  cells are  involved  in  some  sort  of  trafficking.

Or  the  cytotoxic  T  cells  can  use  this  exon

to  regulate  the  tumor versus  normal  functions.

SLAMF9  nine  encodes  a  member

of  the  signaling   lymphocytic  activation molecule  family  and  its  transmembrane.

WARS is  a  tryptophanyl- tRNA  synthetase

that  catalyzes  the  aminoacylation of  tRNA  with  tryptophan

and  is  induced  by  interferons.

All  of  these  genes  have  a  role in  the  immune  system  and  T  cell  function.

We  can  study  these  further

and  determine  if  they  actually  have a  important  function  with  in vitro  tests.

Next,  I  explained  this  earlier,

but  I  analyzed the  tumor  gene  expression  over  time

and  T OMM20,  which  was  the  one that  had  the  most  dramatic  inclusion,

is  actually  implicated  in  the  translocase of  outer  mitochondrial  membrane  complex

that  facilitates  cancer  aggressiveness and  therapeutic  resistance

in   chondrosarcoma, which  is  a  type  of  cancer.

You  can  see  that  TOMM 20  causes cancers  to  be  more  aggressive.

It's  really  interesting how  our  basic  analysis  with   JMP Pro

resulted  in  this  exon being  the  most  important.

This  shows  the  effectiveness of  JMP  in  these  types  of  studies.

In  conclusion,  the  exploration of  this   dataset  facilitated  by  JMP

demonstrates  the  role  of  genes and  exons  in  T  cells  in  melanoma.

A  good  thing  is  that  the  code can  be  replicated  in  Python.

JMP  allows  that for  more  robust  or  detailed  analysis.

These  potential  genes like  TOMM 20  or   STPX9

can  aid  in  the discovery of  novel  and  personalized  approaches

to  cancer  treatment.

And  then  we  can  perform in  vitro  testing  on  the  top  genes.

Thank  you  for  listening to  my  presentation.

I  will  provide  all  the  code and  R  scripts  in  my  presentation  link.