Choose Language Hide Translation Bar
View Original Published Thread

Uncovering Gene Patterns in 3D Matrices of Single Cell Data: An Application of Two New JSL Add-ins (2023-EU-30MP-1327)

Matrices are second-order tensors. In some applications, the input data are third- or higher-order tensors. For example, gene expression data from different cell types found in a single donor form a second-order tensor of the form {cell type × gene}, while the collection of similar cell types on different donors results in a third-order tensor of the form {cell type × gene × donor}. We present two JMP Add-ins, the first of which performs multi-way data analysis of a third-order tensor using non-negative tensor factorization. The second Add-in allows visualization of the major expression patterns found in the tensor in the form of parallel heatmaps. Using a public transcriptome dataset, we show how these tools enable the discovery and visualization of tri-clusters of cell types, genes, and donors that share similar transcriptomic patterns, resulting in cell type signatures that can be used to deconvolute bulk mixtures of different cell types. We also show that the Python package used to perform the core computations can be integrated seamlessly into JSL scripts.

 

 

Hello.  Today  I'm  glad  to  present  you   scA2SIGN  method  to  define  the  signatures  for  the  convolution  starting  from  single- cell  data.  We  will  make  also  a  short  demonstration  of  it.   This  work  was  done  in  collaboration  with   Paul Fogel  and   Alexey Fedorov.

Before  to  do  the  deep  dive  about  our  method,  let's  do  the  short  introduction  in  order  to  explain  you  what  is  the  cell  deconvolution.

The  cell  type  deconvolution  is  an  silica  method  that  allows  to  identify  the  cell  proportions  from  bulk  RNA  sequencing  data.  This  method  requires  to  have  specific  signature  matrixes  containing  the  marker  genes  for  each  type  of  the  interest.

Our  method  proposed  to  identify  this  signature  matrix  and  start  from  the   single-cell data  because  in   single-cell data,  as  you  can  see  here,  we  do  have  the  transcriptomic  profiles  for  each  single  cell  from  the  data  present  in  the  tissue  in  the  samples.   Each  dot  here  corresponds  to  the  cell.  Cells  with  the  closed  transcriptomic  profiles,  they  are  grouped  together  and  form  clusters.

We  can  go  to  the  next  slide,  please.

Here  you  have  a  brief  overview  of  our  method.  Our  method  is  called   scA2SIGN,  which  stands  for  single  cell  agnostic  algorithm  for  the  convolution.  As  was  mentioned  previously,  it  starts  from  the   single-cell data   ideally  coming  from  several  subjects,  individuals,  in  order  to  cover  biological  variability  that  can  be  present  between  individuals.

The   single-cell data  need  to  be  preprocessed  and  assembled  in  one  data  set.   Here  the  cells  with  a  close  transcriptomic  profile  so  they  are  assembled  into  the  clusters.   This  representation  in  two  dimensions  is  called  UMAP.

From  this   single-cell data,  we  need  to  retrieve  the  matrices  which  are  going  for  each  individual.  They  are  going  to  contain  dreams  on  the  rows  and  clusters  on  the  columns.   It  means  we  need  to  aggregate  the  data  and  several  matrices  corresponding  to  each  subject  represents  a  tensor.

If  we  try  to  identify  the  signatures  right  away  at  this  step,  it's  going  to  be  difficult  because  some  clusters  of   single-cell data,  they  close  from   transcriptomic al point  of  view.  It  can  raise  a  problem  of  multicollinearity.

In  order  to  deal  with  this,  we  run  several  rounds  of  non- negative  tensor  factorization  in  order  to  get  the  reduced  tensor  and  assemble  close  clusters  together  in  order  to  define  the  cell  types  with  distant  transcriptomic  profiles  present  in  the  sample.

Once  again,  the  structure  of  the  data  is  going  to  be  the  same.  We  are  going  to  show  it  in  Jump  application  with  dreams  on  rows,  cell  types  on  columns,  and  several  matrices  corresponding  to  several  subjects.  From  this,  our  method  proposed  to  two  simple  criteria  to  define  the  marker  genes.  This  part  is  going  to  be  covered  by  Alexey.

Last  but  not  least,  the  same  genes  can  be  used  in  order  to  label  cell  types  are  defined  by  our  method  based  on  groups  of  the  clusters.  Last  but  not  least,  we  can  use  a  non- negative  regression  in  order  to  perform  the  convolution  on  bulk   RNA  sequencing  data.

With  this  we  are  going  to  talk  a  little  bit  more  about  non- negative  tensor factorization  in  order  to  explain  what  is  it  exactly.   Non-negative  tensor factorization  is  a  dimensionality  reduction  technique.   As  it  was  mentioned,  the  advantage  of  this  method  is  that  it  allows  to  take  into  account  the  several  matrices  coming  from  several  subjects  and  to  do  it  at  one  step,  comparing  to  other  dimensionality  reduction  techniques  like  ICA,  when  you  need  to  run  several  rounds  of  ICA  for  each  matrice  and  next  to  find  a  consensus  in  the  data.

As  output  is  an  input  of  the  NTF,  we  use  the  matrices  previously  described  as  the  output  of  the  NTF.  We  get  the  components .  We  get  the  fingerprints  for  clusters  here  represented  by  W,  the  fingerprints  for  the  subject  represented  by  Q.  In  this  fingerprint,  they  have  the  genes or  the  signature.  Fingerprints  means  a  set  of  genes  with  lower   [inaudible 00:05:19]  defining  by  H.

Each  component  corresponds  a  strong  biological  signal  present  in  the  data.   If  we  make  the  sum  of  these  components  of  these  strong  biological  signals  present  in  the  data  plus  a  small  error,  we  can  approximate  the  data  which  is  represented  by  this  equation.

With  this  we  can  go  to  the  next  step  and  to  see  exactly  in  the  application  how  the  data  looks  like.

Right  now,  we  are  in  the  Jump  application  and  you  can  see  the  input  data  required  for  our  method.   On  the  left,  you  do  see  the  eleven  matrices  coming  from  different  donors.   The  matrices  have  different  names  and  we  used  the  idea  of  the  donors  as  their  names.  In  the  tables,  we  have  17  columns  corresponding  to  17  clusters  identifying  in   single-cell data.  On  the  rows  we  do  have  the  gene  names  preceded  by  the  idea  of  the  individual  in  order  to  distinguish  different  transcriptomic  data  coming  from  different  donors.

Once  we  loaded  the  data  into  the  system,  we  can  launch  our  add-in  and  we  are  going  to  see  how  it  looks  like.   It  looks  like  a  window  with  several  parameters.  These  parameters  are  written  in  Python  because  the  code  behind   use  Python  language.  But  you  do  not  need  to  be  afraid  because  we  are  going  to  change  only  one  important  parameter,  which  is  number  of  components  in  here.

We  suggest  to  start  with  the  number  of  components  equal  to  number  of  clusters  present  in  the  data  and  it's  going  to  be  equal  to  17.  Thank  you.   With  this,  we  can  simply  click  on  the  submit  button  and  the  application  will  run  the  algorithm  behind.   Next  with  this,  I'm  going  to  leave  the  floor  to  Paul  in  order  to  explain  the  output  of  NTF  and  to  guide  you  following  the  next  step  of  our  method.   Thank  you  for  your  attention.

Thank  you,  Garria.  The  output  of  the  a dd-in consists  of  three  tables  with  NTF  components  along  each  dimension  of  the  tensor:  cell  clusters,  subjects,  and  genes.  First,  let  us  look  at  the  cell  cluster  component  stage.

I  open  the  table  and  each  row  on  this  table  corresponds  to  a  cell  cluster  in  each  column  to  a  component.   There  are  17  rows  and  17  components  as  specified  in  the  add-in  input  parameter  by  Galina.  Each  entry  in  the  table  corresponds  to  the  loading  of  a  cell  cluster  onto  a  component.

Now,  scrolling  through  the  columns  of  the  table,  we  see  that  some  of  the  components  contain  many  zeros  like  components  16,  which  is  good  because  it  indicates  that  these  components  approximate  the  expression  data  coming  from  the  few  cell  clusters  with  nonzero  loadings.   Here  we  have  exactly  one  cell  cluster  with  a  one,  and  all  of  us  are  zeros,  or  almost  zero.

The  fact  that  cell  clusters  have   nonzero  loadings  on  the  same  component  indicates  a  high  degree  of  collinearity, so  regrouping  of  these  clusters  is  advisable.   We  have  an  example  here  with  component  12,  where  three  actually  cell  clusters  have   nonzero  loadings.  Now,  let  us  look  at  the  subject  components  table.

I  open  it  and  scrolling  through  the  columns  of  the  tables,  we  see  now  that  unlike  the  site  clusters  components  table,  the  columns  here  tend  to  be  dense,  with  only  a  few  zeros  here  and  there.  This  is  good  because  it  shows  that  most  subjects  contribute  to  each  component.  In  other  words,  the  expression  data  approximated  by  each  component,  or  we  call  it  fingerprint,  applies  to  most  subjects.  In  a  sense,  the  fingerprint  is  universal.

Now,  Jump  has  an  excellent  feature  that  allows  you  to  view  the  histograms  of  the  columns  at  the  top  of  the  table.   Let's  click  on  this  icon  here,  and  we  now  see  all  the  histograms.   We  see  that  the  columns  are  more  or  less  dense  anyway,  and  it  will  be  good  to  be  able  to  quantify  the  degree  of  sparsities  of  each  column.   How  do  we  do  that?  To  quantify  the  degree  of  falsity  involves  the  cell  cluster  and  subject  components.  We  use  the  index  represented  by  this  formula.

This  formula  was  introduced  by   Herfindahl and  Hirschmann  to  evaluate  the  degree  of  concentration  achieved  by  fields.  For  example,  to  avoid  monopoly  situations  that  can  result  from  mergers.  Originally,  the  inverse  formula  was  published,  but  for  simplicity,  we  will  continue  to  refer  to  this  index  as  the   HHI index.

Similar  indexes  have  been  proposed  more  recently.  They  are  mathematically  equivalent,  but  this  index  has  the  advantage  of  being  very  interpretable  as  it  gives  the  number  of   non-negligible  entries  in  the  vector  which  are  highlighted  here.  In  the  example,  we  have  a  number  of  values,  but  only  the  highlighted  values  are   non-negligible.   The   HHI index  of  this  vector  is  five.

Returning  to  our  workflow,  we  see  how  this  index  allows  us  to  identify  components  that  are   sparse along  the  cell  cluster  dimension  and  dense  along  the  subject  dimension.   We  call  these  components   sparse and robust They   are red  crosses  here  with  high  subject   HHI and  low  cluster  side  cluster   HHI on  the  top  of  this  field.

Based  on  these  four  components,  we  can  make  the  necessary  groupings  of  sight  clusters  to  eliminate  collinearities  and  the  remaining  cell  clusters  here... Sorry for this interruption.  [inaudible 00:13:22]. I just lost  [inaudible 00:13:22], I'll put it here.

The  blue   crosses here  correspond  to  the  remaining  cell  clusters  that  will  be  reintroduced  in  a  second  iteration  of  the  analysis.   Note  well  that  this  addresses  with  the  strongest  signal,  after  being  identified  by  the  NTF,  components  are  removed  from  the  data  set  and  the  next  NTF  starts  over  with  the  reduced  data  set.

To  illustrate  this  process...  I'm  sorry  but  I  have  a  problem  with  here  with  the...  It keeps  reappearing.

To  illustrate  this  process,  let's  now  look  at  how  these  iterations  of  the  recursive  NTF  play  out.  We  see  here  the   HHI table  from  the  first  iteration  which  will  help  us  decide  which  NTF  components  to  keep  and  which  to  discover.  As  mentioned  earlier,  good  components  are  those  that  are  specific  of  only  a  few  cell clus tures  which  translates  in  the  low   cell cluster  HHI index.  These  are  labeled  rows  here  at  the  bottom  of  the  table.

We  also  run  one  high  degree  of  robustness  to  subject  diversity  and  this  translates  into  a  high   HHI index.   How  do  we  decide  whether  an   HHI value  is  low  or  high?  We  can  use  that  the  Jump   [inaudible 00:15:47]  platform  which  is  available  in  the  specialist  modeling  platform.   We  run  this  platform  and  we  apply  a  normal  niche  model  to  distinguish  between  low  and  high   HHI values  and  stop  the  resulting  classification.

When  we  do  that,  we  see  here  the  red  crosses  corresponding  to  the  components  that  are  selected  because  they  have  a  high   HHI subject  index  and  a  low   HHI  cell cluster  index  and  they  are  automatically  identified  as  such  by  the  Jump  Normal  platform.

Let  us  look  now  at  the  loadings  of  the  cell clusters  on  these  particular  components .  We  see  on  component  15  and  component  16  that   cell cluster 15  and   cell cluster 17  has  exclusively  associated  these  components  so  they  can  be  well  approximated  by  these  components.   In  contrast  to  these   cell clusters, on  component  13,  we  see  three- side  clusters  regrouped  along  this  component.  This  means  that  they  are  correlated  somehow,  it  will  be  hard  to  distinguish  them,  so  it's  better  to  regroup  them.  Similarly  on  component  12,  with  the  cell clusters   5, 2, and 14.

We  can  now  look  at  a  second  iteration  of  the  Recursive  NTF.   Similarly  we  look  at  the  components   HHI on  the  subjects  and  on  the   cell clusters  and  we  see  now  that  only  one  component,  component  four,  satisfies  these  retrainments  requirements  for  being  retained  during  this  iteration.   If  we  look  now  at  the  components  audience,  we  see  that  cell cluster  seven  is  exclusively  associated  with  this  component.  Good.

Now  that  we  have  done  that,  we  are  not  going  to  show  you  all  of  the  iterations.  Actually,  there  are  four  iterations  and  we  end  up  now  with  these  groupings  that  we  see  here.   We  have  actually  nine  cell  clusters  after  regrouping.  Some  of  them  are  groupings  of  three  or  four  original   cell clusters.  However,  five  cell clusters  did  not  need  to  be  regrouped.   What  is  the  biological  meaning  of  these  groupings?

By  matching  the  most  highly  expressed  genes  in  each  cell  cluster  with  bioinformatic  databases,  it  is  possible  to  associate  the   cell clusters  with   known biological  cell  types.  For  example,  the  NTF  groupings  2- 5-14  was  identified  as  a  natural killer  CD8  cell  type.

We  found  that  there  were  several  groupings  that  actually  pointed  to  the  same  major  key  cell  type.  Therefore,  we  decided  to  merge  them  and  came  up  with  five  main   cell types.  However,  it  is  still  possible  to  extract  signatures  for   [inaudible 00:20:12]  cell typ es  that  can  be  used  in  a  second  stage  of  the  convolution.  But  we  will  come  back  to  this  point  at  the  end  of  this  presentation.

Let  us  now  look  at  the  resulting  reduced  tensor.   I'm  just  opening  the  table.  You  see  here  the  table,  and  as  expected,  there  are  now  five  rows  corresponding  to  the  regrouped  cell  clusters.

Now  that  we  have  introduced  cluster  grouping  to  minimize  collinearity,  I  will  have  Alexey  explain  how  we  proceed  to  create  signature  genes  that  are  most  effective  for  decomposition.

Once  the  cell  clusters  are  grouped,  we  do  still  need  to  create  gene  signatures  for  each  of  the  groupings  to  ensure  accurate  convolutions  of  the  mixed  samples.   So far,  we  have  considered  the  cell  and  subject  component  vectors.  Let  us  now  consider  the  gene  component  vector.

The  table  we  see  here  is  not  really  informative.  We  can  only  see  zeros  everywhere  due  to  the  fact  that  only  small  minority  of  genes  have  actually   nonzero  loadings  on  the  NTF  components.   If  you  can  go  to  the  next  iteration,  please,  Paul.

Let  us  now  examine  the  loadings  of  the  genes  on  this  representation  of  the  components  which  we've  called  sparse  and  robust.

These  were  selected  during  the  first  iteration  of  the  recursive  NTF.  We  see  here  in  red  that  some  of  the  distinctive  genes  are  shared  by  all  components.  We  know  well  that  each  component  is  specific  to  a  particular  cell  cluster  or  a  group  of  closed  cell clusters.  This  means  that  some  of  the  genes  are  actually  strongly  expressed  in  several  clusters  with  some  degree  of  linearity  between  the  cell  clusters.

This  is  exactly  how  niche  works.  Biological  parts  are  not  necessarily  orthogonal  to  each  other.  Therefore,  we  cannot  rely  solely  on  NTF  gene  components  to  identify  gene  markers  that  are  unique  to  a  cell  cluster  or  a  group  of  this,  we  must  return  to  the  actual  expression  data.

We  are  getting  back  to  expression  data  and  we  will  introduce  two  criteria.  The  first  one  for  a  gene  to  be  considered  the  marker  gene  of  a  particular   cell cluster,  we  say  that  it  should  always  be  more  expressed  in  the  cluster  than  in  other  clusters,  regardless  of  which  subject  is  considered.  This  is  shown  in  figure  bottom  left,  where  we  have  depicted  two  genes.

With  some   [inaudible 00:23:20],  we  see  that  these  points  on  the  left  correspond  to  a  unique  marker  gene  for  B  cells.  We  can  see  only  B  cells  everywhere.  On  the  right,  a  non-marker  gene  is  highly  expressed  in  both  B  cells,  non classical  monocytess  and  natural  killer  cells.

The  second  criteria  will  be  that  the  associated  cell  cluster  should  not  only  be  at  the  top  of  the  other  clusters.  When  we  look  at  the  expression  of  the  market  gene  over   [inaudible 00:23:55]  subjects  and  cell  types,  but  also  it  should  be  well  ahead  of  others.

For  this,  we  use  this  score  with  a  formal  issue  on  the  right  that  were  proposed  by   [inaudible 00:24:09]  in  their  original  paper.  This  score  reflects  the  specificity  of  a  gene  marker.  This  score  is  based  solely  on  the  distribution  of  gene  expression  across  subjects  and  cell  types.  Thus  applying  this  to  criteria,  we  obtained  a  list  of  173  marker  genes  that  can  subsequently  be  used  for  the  convolution.

In  this  table,  they  are  listed  together  with  the  values  that  meet  the  two  criteria  I've  talked  about.

Let  us  now  look  at  the  table  of  signature  genes  for  each  of  the  subject  and  major  cell  types.  Here  we  see  the  list  of  genes and  the  columns  correspond  to  each  other's  subjects  and  the  same  list  of  cell  types  for  each  of  the  subject.  In  contrast  to  the  table  of  NTF  loadings  that  we  showed  previously,  this  table  is  not  sparse,  so  it  is  limited  only  to  173  marker  genes  that  we  selected  according  to  our  criteria  with   [inaudible 00:25:26]  gene.  As  I  said,   the  columns  are  for  subject  cell  types.

We  will  now  visualize  how  the  signature  genes  are  expressed  across  subjects.  For  that,  we  will  use  the  second  plug-in  we  developed  which  is  called   adrubix.  This  will  allow  us  to  easily  assess  the  robustness  of  the  gene  signatures  we  got.

We  open  a  small  window  with  a  list  of  parameters  in  Python  syntax  as  for  the  first  plugin.  The  default  values  have  been  already  filled  in  and  we  will  now  only  fill  in  for  demonstration,  the  value  for  the  parameter  which  is  called  scale  along.

Since  the  expression  ranges  differ  significantly  between  genes,  we  will  set  this  parameter  to  zero  to  have  all  the  columns  scaled  and  we  submit.   Just  a  few  seconds  and  we  see  a  picture  with  distinct  signatures  for  each  major  cell  type.  However,  we  see  that  this  heat  map  is  glued  together  in  between  subjects  and  we  can  actually , with  this  plug-in,  very  easily  add  separators  in  between  the  columns  for  each  of  the  subjects.

For  this,  we  change  the  value  of  the  parameter  metadata   cols sep  to  subject  and  we  resubmit.  Now  we  get  the  separate  heat  maps  for  each  of  the  subjects.  This  is  much  clearer  and  as  you've  seen,  this  small  window  contains  numerous  parameters.   You  can  refer  to  the  documentation  for  the  underlying  Python  package  which  is  available  clicking  on  this  link.

Conclude,  we  see  here  actually  that  the  signature  genes  are  very  robust.  We  see  very  little  difference  in  different  subjects.   Like  that  we  can  visually  very  easily  confirm  the  robustness  of   [inaudible 00:28:09] .

Now  we  are  nearing  the  end  of  our  presentation,  I  will  let  Galina  present  some  deconvolution  results  that  were  obtained  with  the  gene  signature  [inaudible 00:28:22] . Galina.

Yes,  thank  you.   Last  but  not  least,  we  would  like  to  show  you  the  performance  of  our  method.   To  evaluate  the  performance  of  our  method,  we  use  the  published  data  coming  from  White  paper  so  mentioned  in  hints  his  reference.   This  data  set  containing  21  different  cell  mixtures.  There  are  the  mixtures  of  different  T  cell  subtypes,  NK  cells,  monocytes  and  B  cells.

In  these  mixtures,  we  have  also  three  different  cell  types  not  covered  by  our  signatures  corresponding  to  colorector  or  breast  cancer  cells  and  two  cell  lines  like  fibroblaster  and  the  tail  cells.

We  decided  to  use  two  criteria  evaluates  overall  performance  of  our  method  Pearson  correlation  coefficient  or   [inaudible 00:29:18]  a  concordance  coefficient  because  we  can  get  very  significant  correlation  between  estimated  and  ground  truth  proportions,  but  we  can  have  the  change  in  the  scale  identifying  the  quantity  of  cells  present  in  the  sample.   That  is  why  we  use  two  criteria.

Overall,  we  obtain  good  results  for  cell  types  covered  by  our  signatures.  The  results  are  slightly  not  so  beautiful  for  T  cells  because  we  don't  have  exact  correspondence  between  T  cell  subset  present  in  the  mixtures  and  T  cells  group  presented  in  our  signatures.  We  have  also  effecto r  cells  and  in  the  mixtures,  this  cell  type  was  not  added.

Last  but  not  least,  we  would  like  also  to  mention  with  our  method  you  can  go  beyond  and  identify  the  subtypes  of  T  cells,  the  monocytes  running  recursive  NTF  and  defining  the  marker  genes  for  each  big  group  of  immune  cells.   Also  that  these  add-ins  that  have  been  presented  and  demonstrated  today,  they  are  available  in  Jump  application.

With  this,  thank  you  for  your  attention.