cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
Pushing the Boundaries of JMP: Introducing JMP Pro for Genomics (2022-US-45MP-1097)

Sam Gardner, Senior Product Manager, Health and Life Sciences, JMP Statistical Discovery LLC
Russ Wolfinger, Distinguished Research Fellow, JMP

 

JMP Pro 17 is a new standalone platform of choice for modern molecular-level data arising in such fields as genomics, metabalomics, and proteomics. Our previous product, JMP Genomics, relied on SAS for data import, processing, and analysis of the large data tables that are associated with -omic problems. New improvements in JMP Pro 17 provide an advanced level of capability and performance that allows it to stand on its own without the need for SAS. 

 

However, the move from JMP Genomics to JMP Pro for Genomics revealed many aspects of JMP Pro that needed to improve. These improvements have pushed the boundaries of what the product can do so that it can now handle these large problems. As a result , JMP Pro 17 is one of the only advanced analytics software packages to provide a combination of interactive and engaging user experience that allows for rapid point-and-click exploration of -omics data, advanced multivariate and predictive modeling tools, and a flexible and adaptive platform (through JMP Scripting and integration with other data science tools).

 

After defining -omics, this presentation examines the types of data used for these problems, the technical challenges that come with preparing and analyzing large wide data tables, and how JMP Pro 17 addresses these challenges. Examples of just how easy it is to do -omic data analysis in JMP Pro 17 are also demonstrated.

 

 

Hi,  this  is  Sam  Gardner  with  JMP.

I'm  a  Product  Manager  at   JMP.

We're  here  to  talk  today about  introducing  JMP Pro  for  Genomics,

pushing  the  boundaries  of   JMP Pro to  enable  data  science  on  the  desktop.

I  am  one  of  the  presenters.

I'll  be  doing  the  introduction to  this  topic.

I'm  S enior  Product  Manager for  Health  and  Life  Sciences

in the  Product  Management  team  at  JMP.

Our co-presenter  today is  Russ   Wolfinger,

who's  a  Distinguished  Research  Fellow

and  our  Director  of  Scientific Discovery  and  Genomics  at  JMP.

We'll  talk  a  little  bit about  the  background

of  genetics  and  genomics, functional  genomics,

and  then  talk  about  what  we're  doing to  transition  from  our  former  product,

JMP  genomics, to  using  JMP  Pro  for  genomics.

Russ  will  demonstrate  some of  the  new  capabilities  in  the  product.

A  little  bit  about  classical  genetics.

This  is  where  a  lot  of  this  got  started.

People  have  been  doing classical  genetics  for  a  long  time.

They've  been  breeding  plants  and  animals

to  get  desired  traits for  those  plants  and  animals.

They've  seen  that  they  can  do  that to  get,  stronger  animals,  better  plants,

plants with  desired  properties  and  so  on.

You  probably  studied  a  long  time  ago, when  you  were  young  in  school,

about  Gregor  Mendel,  the  monk, who  spent  many  years  studying  garden  peas.

He  actually  measured  seven  distinct characteristics  of  these  peas—

their  height,  their  pod  shape  and  color, seed  shape  and  color,

flower  position  and  color—

and  observed  that  as  these  peas were  crossbred  with  each  other,

that  the  traits  were  passed  on

from  the  parent  plants to  the  progeny  plants

following  some  rather specific  mathematical  ratios

who have made  it probabilistically  possible

to  make  predictions about  what  the  progeny  would  look  like

based  on  the  traits  of  the parents.

His and  later  work  established the  principles  of  genetic  inheritance.

What  is  genomics?

Genomics is  more than  just  classical  genetics.

Genomics  uses  a  combination of  DNA  measurement  methods

and  recombinant  DNA  methods

to  sequence  and  assemble  and  analyze the  structure  and  function  of  genomes.

It  differs  from  classical  genetics in that  it  looks

at  the  organism's  full  complement of  genetic  or  hereditary  material.

It  focuses  on  the  interactions between  the  loci  or  the  location

of  different  genes  on  the  genome,

and  the  alleles, the  variation  in  the  genes  in  the  genome,

so  that  you  can  understand  things like  epistasis,  pleiotropic  heterosis,

which  are  things  like,  okay, one  gene  affects  many  things.

That's  pleiotropy.

Epistasis is  that  sometimes,

one  gene  impacts  the  output or  the  effect  of  another  gene.

Heterosis  is  sometimes  you  get synergistic  effects  by  combining  the  genes

from  two  different  parents or  two  different  organisms.

This  all  relies  upon  the  use of  the  central  dogma  of  genomics.

That  dogma  is  that  DNA,  which is  the  code  for  our  biological  systems,

is  transcribed  into  RNA,

which  is  the  code that's  used  to  make  things

and  make  proteins  in  the  body.

The  proteins are  the  little  chemical  engines

that  do  things  inside  the  body and  give it  its  function.

From  that,  you  can  actually  then measure  things  like  metabolites,

what  actually  happens, what  do  those  proteins  actually  do

inside  the  cells  and  inside  the  body.

The  path  is  DNA  creates  RNA creates  protein,

and  the  protein  regulates how  things  function  in  the  body,

and  that  produces  metabolites.

Data  is  really  enabling a  genomics  revolution.

Modern  measurement  techniques are  really  helping  us  understand

the  structure  and  function of  the  genome

and  how  it  works  inside  the  cells in  biological  system.

We  can  sequence  the  genome now.

We've  got  next- generation  sequencing.

Many  years  ago, when  JMP  first moved  into  this  area,

helping  customers to be able to  analyze  this  type  of  data,

the  way  to  measure  it  was  microwaves,

which  was  much  more  focused on  very  specific  parts  of  the  genome,

and  oftentimes  a  very  limited  set of  genes  in  the  genome.

Now,  you  can  sequence the  whole  genome  of  an  organism.

Also,  you  can  look  at  things like  expression  and  regulation.

We're  talking  about  the  metabolites.

What  is the  output  into  the  biological system  that  you  can  measure?

You  can  look  at how  the  proteins  are  produced

or  what   those  proteins  are  doing.

You  can  also  look  at how  the  structure  of  the  DNA  itself,

what's  called  epigenetics,

impacts  the  function  of how  DNA  works and  how  the  genes  work  inside  the  body.

There  are  typically three  main  stages  of  analysis  that  happen

when  you're  doing  this  type  of  work.

One  is  you  just  generate  the  raw  data.

You  do  the  sequencing  work, generate  the  genome- sequencing  data,

or  measure  the  metabolites or  the  protein  expression

or  the  RNA  expression.

And  then  that  generates pretty  large  data  sets

that  have  to  be  filtered and de- multiplexed

and  trimmed  and  scored and  cleaned  up.

This  is  typically  handled in  a  automated  or  semiautomated  workflow

on  computer  systems  that  can process  very  large  data  files.

Then  it  typically  goes  into  a  second  stage where  you  start  to  do  sequence  alignment

and  basically  lining  things  up, and  being  able  to  do  things  like  counts.

How  many  times  did  I  see  the  expression

of  a  particular  RNA  fragment or  RNA  sequence?

Or  how  many  times  did  I  see particular  protein?

Or  all  this  raw  data,  how  does  it  line  up to  actually  make  a  picture

of  what  the  structure of  the  whole  genome  is?

That's  a  pretty  big mathematical  computational  process.

That  typically  also  gets  done on  pretty  large  computational  systems

with  a  lot  of  computational  resources.

And  then  the  third stage,

which  is  the  stage where  JMP  really  has  played  in,

and   where JMP  Pro will  continue  to  play  in,

is  the  determining  genotype  associations and  genotype-to- phenotype  relationships.

A  phenotype  is  just  a  trait  of   organisms,

the  relationship between  the  genes  and  the  traits.

And  also  looking at  correlations  and  associations

of  the  different  genetic  markers inside  the  genome,

or  the   variance  of  the  genetic  markers.

Oftentimes,  what you   want to  do is  you  want  to  characterize  those

and  then  correlate  them to  physical,  biological,

or  maybe  disease  state  characteristics.

All  of  this  can  actually  be  done with  desktop  software.

JMP  Pro  is  our  solution  to  do  that going  forward  in  the  future.

We've  had  a  product  called  JMP  Genomics for  14  years,  up  until  this  year,

that  we  were  providing  the  customers.

It  was  a  combination  product of  JMP  and  SAS.

SAS was  really  needed  back  early when  we  first put  this  out

to  do  a  lot  of  the  data  processing,

because  the  size and  the  types  of  data  we  looked  at

was  very  difficult  to  do with  a  desktop  software  package  like  JMP.

SAS did  the  data  processing, some  of  the  statistical  methods,

but  JMP  was  used for  further  statistical  analysis

and  visualizing the  results  of  those  analysis.

JMP  Genomics  has  been used in research  and  industry

for  a  wide  variety  of  genomics  problems for  many  years.

But  we  made  a  strategic  decision  this  year

to  discontinue  selling  products that  contain  SAS with  them.

That's  part  of  the  decision  that  was  made for  JMP  to  become  an independent  company.

We're  a  wholly-owned  subsidiary of  SAS  now,

and  are  moving  down that  road  of  independence.

We  are  not  going  to  be  selling  anything but  JMP  products  going  forward.

Because  of  that, we  have  looked  now

to  move  the  functions for  genomic  data  analysis  into  JMP  Pro.

In  JMP  Pro  17,  which  will be  available  this  fall  in  2022,

has  been  and  will  be  optimized for   big  and  wide  data  problems.

It's  going  to  have  capabilities to  meet  the  needs

of  genomic  data  science and  genomic  data  scientists.

It's  going  to  utilize  the  strength

of   JMP Pro's  predictive  analytics and  interactive  visualization

to  help  enable  discoveries in  this  area  of  work.

Some  of  the  enhancements  that  we've  made to  push  the  boundaries  of  JMP  Pro

include  just  removing  barriers and  bottlenecks  in  the  software.

It's  one  thing  to  do  analysis on  tens  or  hundreds  or  even  thousands

of  columns  in  a  data  table.

But  when  you  have  a  data  table

which  maybe  has  many  thousands or  hundreds  of  thousands  of  columns,

you  start  to  reveal  limitations sometimes  in  your  software.

By  doing  this  work, we've  uncovered  places

where  we  just  need  to  streamline how  operations  happen  inside  the  program.

We've  done  that.

An  example  would  be if  I  wanted  to  do  a  transformation

on  hundreds  of  thousands  of  columns, we've  significantly  improved  that  process.

It happens  much  faster on  the  data  tables.

Also  being  able  to  do  very  fast  and efficient   multivariate  analysis  methods

like  principal  component  analysis and  clustering,

when  you  have  these really  wide  genomic  data  tables.

And  then  being  able  to  do  models over  and  over  again

on  thousands  and  thousands of  response  columns,

and  to  do  that efficiently  and  effectively.

The  second  goal that  we  have  in  this  transition

is  that  bring  in  some  capabilities in  the  JMP  Pro

that  are  very  specific for  genetic  and  genomic  analysis.

For  instance,  being  able to  import  different  formats

that  are  commonly  used  in  this  area.

Also,  being  able  to  do genetic  marker  analysis  and  simulation,

as  well  as  bringing  in  some newer  popular  data  reduction  methods

such  as  t-SNE  and  Unimap.

Overall,  what  we're  getting  to is  a  product  that's  going  to  be  lean.

It installs  very  quickly.

You  can  use  it  on  your  desktop,

but  you  can  use  it  to  do this  very  powerful  analysis

on  these  large, complex,  wide  data  tables.

To  illustrate  that, I'm  going  to  turn  it  over  to  Russ.

Russ  is  going  to  show  us  actually  how you  can  do  some  realistic  analysis

and  some  real  study  analysis  here on  some  genomic  and  genetic  data.

Well,  thank  you,  Sam.

It's  a  real  exciting  time  for  us.

I  know  I've  actually  been with   the  genomics  analysis  revolution

within  SAS  for  over  20  years  now.

We  actually   [inaudible 00:11:46]  in  the  early  2000s called  Scientific  Solutions,

where  we  were  starting  to  look  at some  of  the  early  micro array  data.

It's  been  a  really  fun  20  years.

Now,  I  would  say,  almost  one of  the  most  exciting  times  ever  for  us,

where  we're  now  able  to  code some  of  these  routines

directly  in  JMP  pro  using  C++.

A  lot  of  them  are  running much  faster  than  we  had

in  the  previous  JMP  Genomics  product.

I  want  to  give  you  a  little f lavor of  that  today  with  an  example.

This  is  a  data  set  on   loblolly pines,

which  for  those  of  you from  the  Southeast

might  know  it  as  probably  one of  the  most  popular  species  of  pine.

Typically,  if  you  go into  Home Depot  or Lowe's

and  buy  some  two- by- fours  or  plywood, it's  going  to  be  made  of  l oblolly.

When  you  fly  into  the  area, you   happen to see  a  lot  of  tree  cover.

Many  of  those, I'd  say  a  good  chunk  of  those  trees,

especially  towards  the  Eastern  part of   North  Carolina,  are  lobl ollies.

It's  a  very  important  species,  one that  we  really  want  to  understand  well.

It's  been  studied  very  thoroughly, and  even  more so  now

that  we've  got  some  crunches  going  on with  home  building  and  what  have  you,

it's  critical  to  understand  it inside  and  out.

Genomic  technology  is  fantastic

for  revealing  some  things that  we  just  never  knew  before.

This  data  is  actually  still  10  years  old.

It  was  from  a  paper  in  the  Journal of  Genetics  by  Resende  et  al.

This  is  a  group  of  researchers from  the  University  of  Florida

and   Embrapa   in  Brazil

and  University  of  Iowa,  I  believe, if I  recall  correctly.

Here's  the  reference if  you  want  to  look  it  up.

The  data  are  also  freely  available.

I've  got  them.

I  went  ahead  and  downloaded  them from  the  supplemental  information

and  loaded  them into  a  JMP  table  that  you  see  here.

As  Sam  was  mentioning, the format   in JMP  Pro

is  what  we  typically like  to  call  a  wide  format,

where  we've  got  everything  in  one  table.

Here,  we've  got  some  genotype indicator  numbers  indicating  the  lines

as  well  as  the  mother  and  father that  the  trees  came  from.

And  then  this  specific  data  set that  I've  got  here,

we've  got  six  traits that  we've  measured.

I  believe  actually  there's  more.

I think  there's  17, if  you  want  to  see  the  reference.

Our  key  focus  of  interest are  these  genetic  markers.

This  data  set's  small by  today's  standards.

We've  only  got  4,800.

I  say  "Only  4,800" but that's  still  quite  a  few.

As  you  can  see, I'm  scrolling  through  here,

they're  all  coded as  either  zero,  one,  or two.

These  are  so-called  SNP  markers, single  nucleotide  polymorphisms,

where  we'll  have  either...

The  number  here  indicates

the  number  of  the  major  allele that  we  have  in  the  data.

Zero  would  be  the  little A, little A,

if you're  familiar with  the  old  genetics  notation.

The  twos  would  be  the  big  A, big A.

The  ones  would  be  all  the  heterozygotes.

So 4,500  of  these  markers.

The  basic  goal  in  the  end,  typically...

In  fact,  that  was  what  the  paper that  this  was  from  was  about.

They  were  comparing  several of  the  popular  predictive  methods.

But  before   we  get  to  prediction,

there's  a  lot  of  really  good  things that  you  want  to  do

just  to  make  sure the  data  are  as  expected,

and  also  to  learn  and  discover  structure and  other  interesting  characteristics.

Let's  dive  in  and  see  what we can do with a  typical  workflow  here  in  JMP Pro.

I  would  typically  just  like to  look  at  the  data  in  JMP.

We  can  use  just  basic  platforms.

For  example,  here, let  me  bring  up  the   multi- area  platform

and  just  check  out  basic  plots of  the  data  against  one  another.

You  can  see,  for  example,  here, rootnum  and  root numbin

are  fairly  highly  correlated with  each  other.

Other ones,  not  so  much.

You   can  do  distributions.

For  example, w e  can  do  it  here with  the  distribution  platform.

These  traits  have  actually already  been  centered,  I  think.

I believe  all  of  them have  a  mean  of  around  zero.

They've  gone  through a  little  bit  of  pre-processing

that  we  won't  go  into  today.

That's the  way  they  came  from  the  paper.

Our  basic  goal  is  to  use  the  genetic information  to  predict  these  traits.

They  represent  various  characteristics of  the  loblolly  trees.

For  example,  C WAC,

I  believe  that's  crowned  with across  the  plant  beddings.

It's  a  measure  of  the  tree  size.

We've  got  other  measurements  of  density and   characteristic  of  the  roots,  etc.

All  important  things  to  know  about and  when  studying  these  trees.

Let  me  walk  you  through  what we  might  consider  a  a  basic  workflow

once  you  have  your  data  set  up  like  this.

Now,  before  doing  that, though,

I  do  want  to  mention  too that  we  have  put  in  a  fair  bit  of  work

to  helping  and  aiding with  importing  such  data.

This  particular  data  came  as just  standard  comma  separated  value  files,

so  no  big  deal  to  import  it.

But  often,  genetic  data  like  this come  in  so-called  VCF  files.

We now  have  new  routines to  be  able  to  import  those  directly,

as  well  as  import  files from  the  popular  database,

and  then  a  few  other  formats, IDAT  and  what  have  you.

Trying  to  make  it  really  easy to  get  your  data  into  JMP.

As you know,  once  you've  got your  data  set  up  in  a  JMP  table,

there's  just  all  kinds of  great  things  you  can  do.

Many  of  the  things  that  you  hear  about...

Give  you  some  more  ideas,

as  well  as  some  new  things that  we've  put  into  place.

To  start  out,  we've  got a  brand  new  couple  of  platforms

under  the  Analyze  menu  here  at  the  bottom.

Genetics. Analyze,  Genetics.

We've  got  Marker  Statistics and  Marker  Simulation.

Let's  run  the  first one, Marker  Statistics.

This  is  just  a  basic  platform  for  looking at  characteristics  of  a  set  of  markers.

You  can  see  here,  I'm  loading.

We've  got  4,853  SNPs  organized in  a  group  here  in  the  JMP  table.

I  just  move  them  over  into  the  markers.

If everything else  is  okay, we'll just  click  OK.

It  runs  quite  quickly.

What  this  basically  does is  it takes  each  marker

and  computes  a  variety  of  standard statistical  genetic  statistics

that  you  can  look  across  here and  see  what's  going  on.

A  key  thing  to  check  for  a  so-called Hardy- Weinberg  Equilibrium.

You  can  do  a  statistical  test  of  that and  get  p- values  from  it,

and  even  plot  these  along in  a  graph  like  this.

On  the  Y  axis,  we  actually  use the  log 10  p-value,

which  we  also  call  the  log worth.

To  go  once  step  further,  you  can  make a  false  discovery  rate  adjustment

to  avoid  the  multiple  testing  problem.

You  can  see  here, we've  actually  plotted  both:

the  raw   p-value,  the  raw  log worth, as  well  as  their  FDR  adjusted   p-value.

They  tend  to  be  quite  similar, especially  for  the  large  ones.

These  markers  up  here  are  ones that  would  be  out  of  equilibrium,

very  likely  due to  the  cloning  of  the  trees.

These would be  markers that  might  tend  to  drift  or  stabilize

over  time  with  future  crosses.

It would be good  to  check  these  out

and  make  sure  the  distributions of  the  alleles  are  as  expected.

Arcing  all  the  way  back to  the Gregor  Mendel  days,

things  that  we  learned  about how   alleles  like this  should  behave.

That's  a  good  place  to  start, just  to  get  an  idea  for  the  markers.

Let's  move  next and  do  some  pattern  discovery.

Here,  there's  several  nice  things we  can  try.

A  very  basic  one  that's  also  been  popular for  decades  with  gene  expression  data

is  just  to  do  hierarchical  clustering.

Again,  I'm  just  going  to  put the  SNPs  in  here.

You  typically  will  want  to  use one  of  these  faster  methods.

Let's  use  fast  ward.

We  do  have  some  missing  values, so  let's  do  imputation.

We'll  go  ahead  and  cluster it  two  ways.

Let's  click  OK  here.

I'm  going  to  go ahead. I'm running  everything  live  today.

A  few  of  these  things will  take  seconds  to  run.

A nalyses  I've  got  that actually  will  take  a  few  minutes

that  I  won't  run  live just  for  sake  of  time.

But  you  can  see  here, this  scale  of  data,

JMP  Pro  can handle  fairly  readily.

This  one,  you  can  see that  the  progress  bar  here

will  take  probably   30 seconds to  a  minute  to  finish.

But  not  too  bad for  a  medium- sized  data  set  like  this.

Again,  we're  clustering  around  926  rows and  4,800  columns.

But  before  actually the  performance  enhancements,

this  kind of analysis would  take  several  minutes.

In  many  cases,  we've  been  able to  achieve   orders of  magnitude  speed  up.

I'm  able,  basically,  to  enable  you  to do  analyses  like  this  close  to  real  time.

A  little  bit  of  waiting  might  be  required as  here,  but  in  general,

it's pretty  nice  to  be  able  to  quickly get  answers  to  fairly  difficult  questions.

For  example,  here,  we're  trying  to  see

how  other  rows  of  our  data cluster  with  each  other.

Now  here,  a  very  interesting  thing  occurs.

You  can  see  I've  got  colorings that  I  did  to  the  data.

I  colored   the  mother  and  father, or maternal  and  paternal  alleles.

If  we  look  at  this  variable  here, there's  around  71  unique  levels.

And  then  within  each  cross, there's   up  to  17  or  20  individuals.

The  data  have  very  nice,  tight  clusters.

The  clustering  algorithm actually  found  those.

You can  see  the  colors indicate  the  coloring.

This  color  theme  is  a  bit  jarring.

Let's  move  it  to black  and  white.

We  can  see  the  structure a  little  more  cleanly.

Here,  we  can  see  the  areas  of  white

or  where  we've  got  some  of  those minor  alleles  starting  to  cluster

and  identifying  the  key  places in  the  genome

that  distinguish  these  unique  crosses.

This  is  a  nice  plot  just  to  get an  overall  feel  for  the  various  lines

and  how  they  compare  with  one  another.

But  the  main  lesson are  these  tight  clusters

that  are  mapping  up  exactly  like we  would  expect  with  the  initial  crosses,

basically  like  very  close  siblings to  one  another

compared  to  cousins,  or  second  cousins, third cousins,  etc .

Now,  another  way  to  go  about  this

would  be  more  of  a  dimension reduction  type  approach.

Here,  the  number  one  analysis is  principal  components.

Let's  try  that  on  our  steps and  see  what  that  reveals.

Here,   let's  just   use  the  defaults.

Sorry,  actually, I  wanted  to  show  off...

There's  a  brand  new  method  for  wide  data that's  called  fast  approximate.

It's  a  nice   addition  in  software.

It  actually  uses, if  you're  familiar  with  the  method

called  a  randomized  SVD  approach.

You  can  see  a little  message.

Let's  see  what's  in  the  log.

It  turned  out  this  was  actually  one  case

where  an  error  message was  quite  beneficial.

The  software  actually  indicated which  markers...

There  were  some  markers, they  were  non-numeric  or  constant.

It  turned  out  that  a  handful  of  these markers  in  the  table  were  constant.

This  would  be  a  case  where  we  could go  back  and  actually  clean  those  out,

since  they're  not  really  contributing  much to  the  analysis,  they're  just  constant.

But  the  PCA  platform found  them  as  a  byproduct.

But  if  you  look  at  the  scores, first two  principal  components,

we  again  have  this nice  clustering  of  families.

As  usual  with  JMP,

all  these  plots  are  interactive and  connected  to  one  another.

We  can,  for  example,  click  on one  of  the  branches  of  the  tree  over  here,

and  it  will  highlight that  cluster  in  the  PCA.

We  can  map  these  two  graphs to  one  another.

In  fact,  well,  let's  do  that. We  can  add  a  third one.

This  is  another  brand  new  platform that's  just  coming  out  in  JMP  17,

called  Multivariate  Embedding.

Here,  we're  going  to  compute the  popular  t-SNE  algorithm,

which  stands  for  T  multivariate  embedding.

This  has  actually   been  quite  popular in  the  machine  learning  world,

and   it has  trickled   its  way into  the  genomics  field,

especially  with  single- cell  RNA.

It  does  a  little  bit  different dimensional  projection  than  PCA.

It  tries  to  identify  local  structure,

whereas  PCA  is  looking  for  dimensions of  largest  variability  across  all  markers.

T-SNE's  trying  to  find tight  local  clusters.

It's  actually  perfect for this kind of data,

just  to  reveal  these  families.

You  can  see  the  nice  little  groups of  clusters,  and  maybe  more  importantly,

which  clusters  themselves are  near  each  other.

You  can  take  a  picture  here.

Kind  of  looks  like  a  butterfly, something   t-SNE will  often  have .

I'd  encourage  you  to  try  it  on  your  data once  you  get  your  hands  on  JMP  17.0.

That's  revealing  some nice  structure  in  the  data.

Let's  move  on  now  to  a dd some  more statistically- oriented  modeling.

For  it,  the  basic  thing to  usually  start  out  with

is  what  we  would  call a  genome- wide a ssociation  study,

where  w e'll  basically  take  our  trait, or  our  traits,  in  this  case,

and  screen  them  against  all  the  markers.

The  workhorse  platform  here is  Response  Screening.

I'm  going  to  Analyze,  Screening, Response  Screening.

We've  done  quite  a  bit  of  work  on  this thanks  especially to  John  Saul ,

who has  implemented  some  nice performance  improvements.

What  this  does  is  basically a  big  Y  by  X  analysis.

I'm  going  to  move  our  six  targets or  responses  into  the  Y  field,

our  SNPs  into  X.

And  then  all  you  do  is  hit  Go.

What  this  will  do...

I  think  we  do  imputation.

I  think  it  might  do  that  automatically. Let's  see.

Yeah.

This  one  runs  lightning  fast.

I  basically  just  did  six  times 4,800  quick  regressions  and  plotted  all.

This  is  a  plot  of  all the   p-values  at  once.

Again,  focusing  on  false  discovery  rate.

It's  got  to  be  very  careful about  overfishing  data  like  this.

You  want  to  make  sure any  lead  that  you  chase  is  significant,

even  after  a  false  discovery  adjustment.

Here,  we  see  now that  this  crown width  feature

is  the  one  that's  popping  out with  the  most  hits.

Then  there's  one  for  rustbin.

These  are  sorted  by  significance,  and  then some  of  the  other  traits  start  to  pop  in.

But  clearly,  it  looks  like we've  got  the  most  genetic  action

with  this  crown width  trait.

Now,  to  go  a  little  further and  illustrate  the  things  we  can  do.

This  is  very  JMP  Pro like.

Let's  save  the  table  out  of   p-values.

We've  got  everything  now in  a  new  JMP  table,

which  is  effectively  all  the  results, and  they're  nicely  colored  for  us.

Just  want  to  browse  the  table.

But  I'm  going  to  go  ahead and  use  Graph B uilder  now.

Let's  make  some  volcano  plots  by  hand.

For  these,  we  w ant  to  put the  slope  on  the X- axis,

and  then  the  log worth  on  the  Y.

Let's  go  ahead.

We'll  make  a  separate  one for  each  of  our  traits.

I'm  dragging  that  onto  the  wrap.

You  can  see  here,  this  is  the  kind of thing  that  JMP  is  really  interesting  at.

I t often  will  find  outliers  of the  data.

Here's  one  that's  way  out  here.

We've  got  a  slope  estimate of  nearly  negative  2,000.

It  turns  out  that  this  variable is  nearly  constant.

The  regression  just  blows  up with  an  almost  nearly  vertical,

or  nearly  negative, highly  negative  slope.

It  turns  out  this  is  more  of  an  anomaly than  an  actual  significant  hit.

It  would  actually  make  sense just  to  ignore  it.

But  it's  actually  nice  to  find that  it's  in  the  table

and  be  able  to  identify  it.

This  is  the  kind of thing that  JMP  is  often  really  good  at,

finding   weird  patterns.

But  to  hone  in  on  the  key  results, let's  go  ahead  and  narrow  our  axes  down.

I  just  hit  the  axis  button, and  we're  going  to  just  zoom  in.

Let's  go  minus   10 to 10.

You  can  see  here, you  get  this  characteristic  V  shape,

where   again, we're  plotting  the  slope  of  the  regression

versus  its  negative  log  p-value.

For CWAC,  we  actually  got,

again,  as  we  expected  before, more  hits  than  anywhere  else.

A  bunch  of  markers for  positive  and  negative  slope,

which  would  indicate a   additive  genetic  relationship

going  one  way  to  the  other.

For  the  other  traits, these  are  also  V  shape,

and  many  of  them are  just  really  a  lot  less  significant

and  often sq uished  in  with  one  another.

The  slope  also  depends on  the  scale  of  the  measurement.

It's maybe  not  quite  as  meaningful if we  put  all  these  on  the  same  exact  scale.

But  I  just  wanted  to  show  this for  illustration,

as  a  way  to  compare  everything side  by  side.

That's a GWAS.

Moving  forward,  let's  get  to  probably what  our  main  objective  would  be,

which  would  be  to  predict  these traits  as  a  function  of  the  markers.

Here,  we  do  have  access  to  all the  great  predictive  modeling  platforms

that  are  in  JMP.

Some  of  these,  you  have to  be  a  little  careful  to  use.

With  missing  data, you  may  need  to  do  the  imputation  first.

Some  might  become  quite  slow given  the  size  of  the  problem.

For  today,  I  just  want  to  show probably  my  favorite  one,

which  is  XG Boost, using  the  XGB oost  platform.

This  is  a  case where  I  actually  ran  this  beforehand,

because,  and  it's  to  run.. .

But  I  l oaded  all  six  traits  into  XGB oost and  did  ten- fold  cost  validation.

I  automatically  left  out each  of  the  ten  folds.

Here,  you  can  see  the  results  of  that  run,

where  we've  got  the  solid  lines here  in  these  graphs,

are  the  validation  curves over  the  iterations

and  the  dotted  lines  of  the  training.

You  can  see  with  these  wide  problems, there's  a  severe  risk  of  overfitting,

especially  with  a  powerful approach  like  XG Boost.

You  have  to  be  very  careful.

As you  can  see,  I actually [inaudible 00:32:18]  parameters.

I  could  tweak  them  down  for  one,  and  you can  see  the  other  parameters  here.

Within  each  model  fit, we've  got  both  the  training,

observed  versus  predicted, and  the  validation.

You  can  see  here  for  C WAC we  got  a  correlation  of  around  0.43.

Correlation  is  a  typical  measure used  to  assess  performance.

This  is  competitive  with  what  was published  in  the  paper  before,

without  hardly  much  tuning  at  all.

But  then  there's  a  lot  of  other interesting  things  you  can  dive  into,

the  most  important  features,  etc.

We  even  got  some  new  things  for   instance, one  thing  called  Shapley  values

that  I'd  encourage  you  to  check  out.

There's  going  to  be  another  talk on  this  topic  by  Peter  Hirsch,

Florian  Laura  Lancaster and myself  on  that  here  at  the   conference,

I  would  encourage  you  to  check  that  out.

It's  a  way  to  break  down predictions  into  their  components.

That  gets  another  level you  can  go  into  with  predicting.

That's  just  one  example  of  some nice  predictive  modeling  you  can  do.

To  wrap  up  the  demo, I  wanted  to  return  back

where  we  started  here in  this  Genetics  menu.

We've  got  a  marker, a  brand  new  marker  simulation  platform.

This  is  some  pretty advanced  genetic  modeling

carried  out  by  our  internal  expert, Luciano  Silva.

What  this  does  is  it  actually  will  do virtual  crossing  by  the  genotypes.

The  idea  is  you'd  load  the  markers  in.

The  really  interesting  thing is  you  can  put  a  predictor  formula  here.

For  example,  I  save  the  predictor  formula from  the   XGBoost  model  of   CWAC.

What  this  will  do  is  both simulate  the  crosses

and  predict  their  performance.

This  is  what  modern virtual  breeding  does.

You  can  actually  virtually  cross different  loblolly  pine  trees

and  predict  what  will  happen  with  them

without  having  to  wait  10,  20,  30  years to  grow  them  in  the  field.

Extremely  powerful,  interesting  approach that  revolutionized  the  way

modern  breeding  is  done, and  why  so-called  genomic  selection,

or  predictive  modeling with  genetic  markers  is  so  popular.

I'll  go  ahead  and  conclude  there.

I  hope  that  whetted  your  appetite

with  some  of  the  new  things we've  got  going.

A  lot  of  the  things  I  showed  today would  also  work  with  gene  expression  data,

although  that's  a  little  bit different  ballgame

in  terms  of  what  you're  trying  to  do.

But  for  sake  of  time,  I  thought  it  would be  good  just  to  look  at  this  one  example

and   dive  somewhat  deep.

Thank  you  very  much  for  your  attention.

Let  us  know  if  you've  got  questions as  you  have  them.

We're  really  e xcited  about the  new  things  coming  in  JMP  17  Pro.

We've  got  a  lot  more  things coming  in  the  works.

Thank  you very much.

We  recognize  that  lot  of  people that  come  to  discovery,

this  may  not  be  their  area  of  expertise.

But you  may  know  somebody who's  doing  this  work,

and  we  would  love  to  get  them  connected with  what  we're  doing  here  at  JMP  Pro,

because  we  are  going  to  continue to  invest  in  adding  capabilities

and  improving  the  software  so  it  can do  work  like  this  better  and  better

to  meet  the  needs  of  scientists across  the  life  sciences

and  this  industry.

Thanks  for  listening  in.