Hi, this is Sam Gardner with JMP.
I'm a Product Manager at JMP.
We're here to talk today about introducing JMP Pro for Genomics,
pushing the boundaries of JMP Pro to enable data science on the desktop.
I am one of the presenters.
I'll be doing the introduction to this topic.
I'm S enior Product Manager for Health and Life Sciences
in the Product Management team at JMP.
Our co-presenter today is Russ Wolfinger,
who's a Distinguished Research Fellow
and our Director of Scientific Discovery and Genomics at JMP.
We'll talk a little bit about the background
of genetics and genomics, functional genomics,
and then talk about what we're doing to transition from our former product,
JMP genomics, to using JMP Pro for genomics.
Russ will demonstrate some of the new capabilities in the product.
A little bit about classical genetics.
This is where a lot of this got started.
People have been doing classical genetics for a long time.
They've been breeding plants and animals
to get desired traits for those plants and animals.
They've seen that they can do that to get, stronger animals, better plants,
plants with desired properties and so on.
You probably studied a long time ago, when you were young in school,
about Gregor Mendel, the monk, who spent many years studying garden peas.
He actually measured seven distinct characteristics of these peas—
their height, their pod shape and color, seed shape and color,
flower position and color—
and observed that as these peas were crossbred with each other,
that the traits were passed on
from the parent plants to the progeny plants
following some rather specific mathematical ratios
who have made it probabilistically possible
to make predictions about what the progeny would look like
based on the traits of the parents.
His and later work established the principles of genetic inheritance.
What is genomics?
Genomics is more than just classical genetics.
Genomics uses a combination of DNA measurement methods
and recombinant DNA methods
to sequence and assemble and analyze the structure and function of genomes.
It differs from classical genetics in that it looks
at the organism's full complement of genetic or hereditary material.
It focuses on the interactions between the loci or the location
of different genes on the genome,
and the alleles, the variation in the genes in the genome,
so that you can understand things like epistasis, pleiotropic heterosis,
which are things like, okay, one gene affects many things.
That's pleiotropy.
Epistasis is that sometimes,
one gene impacts the output or the effect of another gene.
Heterosis is sometimes you get synergistic effects by combining the genes
from two different parents or two different organisms.
This all relies upon the use of the central dogma of genomics.
That dogma is that DNA, which is the code for our biological systems,
is transcribed into RNA,
which is the code that's used to make things
and make proteins in the body.
The proteins are the little chemical engines
that do things inside the body and give it its function.
From that, you can actually then measure things like metabolites,
what actually happens, what do those proteins actually do
inside the cells and inside the body.
The path is DNA creates RNA creates protein,
and the protein regulates how things function in the body,
and that produces metabolites.
Data is really enabling a genomics revolution.
Modern measurement techniques are really helping us understand
the structure and function of the genome
and how it works inside the cells in biological system.
We can sequence the genome now.
We've got next- generation sequencing.
Many years ago, when JMP first moved into this area,
helping customers to be able to analyze this type of data,
the way to measure it was microwaves,
which was much more focused on very specific parts of the genome,
and oftentimes a very limited set of genes in the genome.
Now, you can sequence the whole genome of an organism.
Also, you can look at things like expression and regulation.
We're talking about the metabolites.
What is the output into the biological system that you can measure?
You can look at how the proteins are produced
or what those proteins are doing.
You can also look at how the structure of the DNA itself,
what's called epigenetics,
impacts the function of how DNA works and how the genes work inside the body.
There are typically three main stages of analysis that happen
when you're doing this type of work.
One is you just generate the raw data.
You do the sequencing work, generate the genome- sequencing data,
or measure the metabolites or the protein expression
or the RNA expression.
And then that generates pretty large data sets
that have to be filtered and de- multiplexed
and trimmed and scored and cleaned up.
This is typically handled in a automated or semiautomated workflow
on computer systems that can process very large data files.
Then it typically goes into a second stage where you start to do sequence alignment
and basically lining things up, and being able to do things like counts.
How many times did I see the expression
of a particular RNA fragment or RNA sequence?
Or how many times did I see a particular protein?
Or all this raw data, how does it line up to actually make a picture
of what the structure of the whole genome is?
That's a pretty big mathematical computational process.
That typically also gets done on pretty large computational systems
with a lot of computational resources.
And then the third stage,
which is the stage where JMP really has played in,
and where JMP Pro will continue to play in,
is the determining genotype associations and genotype-to- phenotype relationships.
A phenotype is just a trait of organisms,
the relationship between the genes and the traits.
And also looking at correlations and associations
of the different genetic markers inside the genome,
or the variance of the genetic markers.
Oftentimes, what you want to do is you want to characterize those
and then correlate them to physical, biological,
or maybe disease state characteristics.
All of this can actually be done with desktop software.
JMP Pro is our solution to do that going forward in the future.
We've had a product called JMP Genomics for 14 years, up until this year,
that we were providing the customers.
It was a combination product of JMP and SAS.
SAS was really needed back early when we first put this out
to do a lot of the data processing,
because the size and the types of data we looked at
was very difficult to do with a desktop software package like JMP.
SAS did the data processing, some of the statistical methods,
but JMP was used for further statistical analysis
and visualizing the results of those analysis.
JMP Genomics has been used in research and industry
for a wide variety of genomics problems for many years.
But we made a strategic decision this year
to discontinue selling products that contain SAS with them.
That's part of the decision that was made for JMP to become an independent company.
We're a wholly-owned subsidiary of SAS now,
and are moving down that road of independence.
We are not going to be selling anything but JMP products going forward.
Because of that, we have looked now
to move the functions for genomic data analysis into JMP Pro.
In JMP Pro 17, which will be available this fall in 2022,
has been and will be optimized for big and wide data problems.
It's going to have capabilities to meet the needs
of genomic data science and genomic data scientists.
It's going to utilize the strength
of JMP Pro's predictive analytics and interactive visualization
to help enable discoveries in this area of work.
Some of the enhancements that we've made to push the boundaries of JMP Pro
include just removing barriers and bottlenecks in the software.
It's one thing to do analysis on tens or hundreds or even thousands
of columns in a data table.
But when you have a data table
which maybe has many thousands or hundreds of thousands of columns,
you start to reveal limitations sometimes in your software.
By doing this work, we've uncovered places
where we just need to streamline how operations happen inside the program.
We've done that.
An example would be if I wanted to do a transformation
on hundreds of thousands of columns, we've significantly improved that process.
It happens much faster on the data tables.
Also being able to do very fast and efficient multivariate analysis methods
like principal component analysis and clustering,
when you have these really wide genomic data tables.
And then being able to do models over and over again
on thousands and thousands of response columns,
and to do that efficiently and effectively.
The second goal that we have in this transition
is that bring in some capabilities in the JMP Pro
that are very specific for genetic and genomic analysis.
For instance, being able to import different formats
that are commonly used in this area.
Also, being able to do genetic marker analysis and simulation,
as well as bringing in some newer popular data reduction methods
such as t-SNE and Unimap.
Overall, what we're getting to is a product that's going to be lean.
It installs very quickly.
You can use it on your desktop,
but you can use it to do this very powerful analysis
on these large, complex, wide data tables.
To illustrate that, I'm going to turn it over to Russ.
Russ is going to show us actually how you can do some realistic analysis
and some real study analysis here on some genomic and genetic data.
Well, thank you, Sam.
It's a real exciting time for us.
I know I've actually been with the genomics analysis revolution
within SAS for over 20 years now.
We actually [inaudible 00:11:46] in the early 2000s called Scientific Solutions,
where we were starting to look at some of the early micro array data.
It's been a really fun 20 years.
Now, I would say, almost one of the most exciting times ever for us,
where we're now able to code some of these routines
directly in JMP pro using C++.
A lot of them are running much faster than we had
in the previous JMP Genomics product.
I want to give you a little f lavor of that today with an example.
This is a data set on loblolly pines,
which for those of you from the Southeast
might know it as probably one of the most popular species of pine.
Typically, if you go into Home Depot or Lowe's
and buy some two- by- fours or plywood, it's going to be made of l oblolly.
When you fly into the area, you happen to see a lot of tree cover.
Many of those, I'd say a good chunk of those trees,
especially towards the Eastern part of North Carolina, are lobl ollies.
It's a very important species, one that we really want to understand well.
It's been studied very thoroughly, and even more so now
that we've got some crunches going on with home building and what have you,
it's critical to understand it inside and out.
Genomic technology is fantastic
for revealing some things that we just never knew before.
This data is actually still 10 years old.
It was from a paper in the Journal of Genetics by Resende et al.
This is a group of researchers from the University of Florida
and Embrapa in Brazil
and University of Iowa, I believe, if I recall correctly.
Here's the reference if you want to look it up.
The data are also freely available.
I've got them.
I went ahead and downloaded them from the supplemental information
and loaded them into a JMP table that you see here.
As Sam was mentioning, the format in JMP Pro
is what we typically like to call a wide format,
where we've got everything in one table.
Here, we've got some genotype indicator numbers indicating the lines
as well as the mother and father that the trees came from.
And then this specific data set that I've got here,
we've got six traits that we've measured.
I believe actually there's more.
I think there's 17, if you want to see the reference.
Our key focus of interest are these genetic markers.
This data set's small by today's standards.
We've only got 4,800.
I say "Only 4,800" but that's still quite a few.
As you can see, I'm scrolling through here,
they're all coded as either zero, one, or two.
These are so-called SNP markers, single nucleotide polymorphisms,
where we'll have either...
The number here indicates
the number of the major allele that we have in the data.
Zero would be the little A, little A,
if you're familiar with the old genetics notation.
The twos would be the big A, big A.
The ones would be all the heterozygotes.
So 4,500 of these markers.
The basic goal in the end, typically...
In fact, that was what the paper that this was from was about.
They were comparing several of the popular predictive methods.
But before we get to prediction,
there's a lot of really good things that you want to do
just to make sure the data are as expected,
and also to learn and discover structure and other interesting characteristics.
Let's dive in and see what we can do with a typical workflow here in JMP Pro.
I would typically just like to look at the data in JMP.
We can use just basic platforms.
For example, here, let me bring up the multi- area platform
and just check out basic plots of the data against one another.
You can see, for example, here, rootnum and root numbin
are fairly highly correlated with each other.
Other ones, not so much.
You can do distributions.
For example, w e can do it here with the distribution platform.
These traits have actually already been centered, I think.
I believe all of them have a mean of around zero.
They've gone through a little bit of pre-processing
that we won't go into today.
That's the way they came from the paper.
Our basic goal is to use the genetic information to predict these traits.
They represent various characteristics of the loblolly trees.
For example, C WAC,
I believe that's crowned with across the plant beddings.
It's a measure of the tree size.
We've got other measurements of density and characteristic of the roots, etc.
All important things to know about and when studying these trees.
Let me walk you through what we might consider a a basic workflow
once you have your data set up like this.
Now, before doing that, though,
I do want to mention too that we have put in a fair bit of work
to helping and aiding with importing such data.
This particular data came as just standard comma separated value files,
so no big deal to import it.
But often, genetic data like this come in so-called VCF files.
We now have new routines to be able to import those directly,
as well as import files from the popular database,
and then a few other formats, IDAT and what have you.
Trying to make it really easy to get your data into JMP.
As you know, once you've got your data set up in a JMP table,
there's just all kinds of great things you can do.
Many of the things that you hear about...
Give you some more ideas,
as well as some new things that we've put into place.
To start out, we've got a brand new couple of platforms
under the Analyze menu here at the bottom.
Genetics. Analyze, Genetics.
We've got Marker Statistics and Marker Simulation.
Let's run the first one, Marker Statistics.
This is just a basic platform for looking at characteristics of a set of markers.
You can see here, I'm loading.
We've got 4,853 SNPs organized in a group here in the JMP table.
I just move them over into the markers.
If everything else is okay, we'll just click OK.
It runs quite quickly.
What this basically does is it takes each marker
and computes a variety of standard statistical genetic statistics
that you can look across here and see what's going on.
A key thing to check for a so-called Hardy- Weinberg Equilibrium.
You can do a statistical test of that and get p- values from it,
and even plot these along in a graph like this.
On the Y axis, we actually use the log 10 p-value,
which we also call the log worth.
To go once step further, you can make a false discovery rate adjustment
to avoid the multiple testing problem.
You can see here, we've actually plotted both:
the raw p-value, the raw log worth, as well as their FDR adjusted p-value.
They tend to be quite similar, especially for the large ones.
These markers up here are ones that would be out of equilibrium,
very likely due to the cloning of the trees.
These would be markers that might tend to drift or stabilize
over time with future crosses.
It would be good to check these out
and make sure the distributions of the alleles are as expected.
Arcing all the way back to the Gregor Mendel days,
things that we learned about how alleles like this should behave.
That's a good place to start, just to get an idea for the markers.
Let's move next and do some pattern discovery.
Here, there's several nice things we can try.
A very basic one that's also been popular for decades with gene expression data
is just to do hierarchical clustering.
Again, I'm just going to put the SNPs in here.
You typically will want to use one of these faster methods.
Let's use fast ward.
We do have some missing values, so let's do imputation.
We'll go ahead and cluster it two ways.
Let's click OK here.
I'm going to go ahead. I'm running everything live today.
A few of these things will take seconds to run.
A nalyses I've got that actually will take a few minutes
that I won't run live just for sake of time.
But you can see here, this scale of data,
JMP Pro can handle fairly readily.
This one, you can see that the progress bar here
will take probably 30 seconds to a minute to finish.
But not too bad for a medium- sized data set like this.
Again, we're clustering around 926 rows and 4,800 columns.
But before actually the performance enhancements,
this kind of analysis would take several minutes.
In many cases, we've been able to achieve orders of magnitude speed up.
I'm able, basically, to enable you to do analyses like this close to real time.
A little bit of waiting might be required as here, but in general,
it's pretty nice to be able to quickly get answers to fairly difficult questions.
For example, here, we're trying to see
how other rows of our data cluster with each other.
Now here, a very interesting thing occurs.
You can see I've got colorings that I did to the data.
I colored the mother and father, or maternal and paternal alleles.
If we look at this variable here, there's around 71 unique levels.
And then within each cross, there's up to 17 or 20 individuals.
The data have very nice, tight clusters.
The clustering algorithm actually found those.
You can see the colors indicate the coloring.
This color theme is a bit jarring.
Let's move it to black and white.
We can see the structure a little more cleanly.
Here, we can see the areas of white
or where we've got some of those minor alleles starting to cluster
and identifying the key places in the genome
that distinguish these unique crosses.
This is a nice plot just to get an overall feel for the various lines
and how they compare with one another.
But the main lesson are these tight clusters
that are mapping up exactly like we would expect with the initial crosses,
basically like very close siblings to one another
compared to cousins, or second cousins, third cousins, etc .
Now, another way to go about this
would be more of a dimension reduction type approach.
Here, the number one analysis is principal components.
Let's try that on our steps and see what that reveals.
Here, let's just use the defaults.
Sorry, actually, I wanted to show off...
There's a brand new method for wide data that's called fast approximate.
It's a nice addition in software.
It actually uses, if you're familiar with the method
called a randomized SVD approach.
You can see a little message.
Let's see what's in the log.
It turned out this was actually one case
where an error message was quite beneficial.
The software actually indicated which markers...
There were some markers, they were non-numeric or constant.
It turned out that a handful of these markers in the table were constant.
This would be a case where we could go back and actually clean those out,
since they're not really contributing much to the analysis, they're just constant.
But the PCA platform found them as a byproduct.
But if you look at the scores, first two principal components,
we again have this nice clustering of families.
As usual with JMP,
all these plots are interactive and connected to one another.
We can, for example, click on one of the branches of the tree over here,
and it will highlight that cluster in the PCA.
We can map these two graphs to one another.
In fact, well, let's do that. We can add a third one.
This is another brand new platform that's just coming out in JMP 17,
called Multivariate Embedding.
Here, we're going to compute the popular t-SNE algorithm,
which stands for T multivariate embedding.
This has actually been quite popular in the machine learning world,
and it has trickled its way into the genomics field,
especially with single- cell RNA.
It does a little bit different dimensional projection than PCA.
It tries to identify local structure,
whereas PCA is looking for dimensions of largest variability across all markers.
T-SNE's trying to find tight local clusters.
It's actually perfect for this kind of data,
just to reveal these families.
You can see the nice little groups of clusters, and maybe more importantly,
which clusters themselves are near each other.
You can take a picture here.
Kind of looks like a butterfly, something t-SNE will often have .
I'd encourage you to try it on your data once you get your hands on JMP 17.0.
That's revealing some nice structure in the data.
Let's move on now to a dd some more statistically- oriented modeling.
For it, the basic thing to usually start out with
is what we would call a genome- wide a ssociation study,
where w e'll basically take our trait, or our traits, in this case,
and screen them against all the markers.
The workhorse platform here is Response Screening.
I'm going to Analyze, Screening, Response Screening.
We've done quite a bit of work on this thanks especially to John Saul ,
who has implemented some nice performance improvements.
What this does is basically a big Y by X analysis.
I'm going to move our six targets or responses into the Y field,
our SNPs into X.
And then all you do is hit Go.
What this will do...
I think we do imputation.
I think it might do that automatically. Let's see.
Yeah.
This one runs lightning fast.
I basically just did six times 4,800 quick regressions and plotted all.
This is a plot of all the p-values at once.
Again, focusing on false discovery rate.
It's got to be very careful about overfishing data like this.
You want to make sure any lead that you chase is significant,
even after a false discovery adjustment.
Here, we see now that this crown width feature
is the one that's popping out with the most hits.
Then there's one for rustbin.
These are sorted by significance, and then some of the other traits start to pop in.
But clearly, it looks like we've got the most genetic action
with this crown width trait.
Now, to go a little further and illustrate the things we can do.
This is very JMP Pro like.
Let's save the table out of p-values.
We've got everything now in a new JMP table,
which is effectively all the results, and they're nicely colored for us.
Just want to browse the table.
But I'm going to go ahead and use Graph B uilder now.
Let's make some volcano plots by hand.
For these, we w ant to put the slope on the X- axis,
and then the log worth on the Y.
Let's go ahead.
We'll make a separate one for each of our traits.
I'm dragging that onto the wrap.
You can see here, this is the kind of thing that JMP is really interesting at.
I t often will find outliers of the data.
Here's one that's way out here.
We've got a slope estimate of nearly negative 2,000.
It turns out that this variable is nearly constant.
The regression just blows up with an almost nearly vertical,
or nearly negative, highly negative slope.
It turns out this is more of an anomaly than an actual significant hit.
It would actually make sense just to ignore it.
But it's actually nice to find that it's in the table
and be able to identify it.
This is the kind of thing that JMP is often really good at,
finding weird patterns.
But to hone in on the key results, let's go ahead and narrow our axes down.
I just hit the axis button, and we're going to just zoom in.
Let's go minus 10 to 10.
You can see here, you get this characteristic V shape,
where again, we're plotting the slope of the regression
versus its negative log p-value.
For CWAC, we actually got,
again, as we expected before, more hits than anywhere else.
A bunch of markers for positive and negative slope,
which would indicate a additive genetic relationship
going one way to the other.
For the other traits, these are also V shape,
and many of them are just really a lot less significant
and often sq uished in with one another.
The slope also depends on the scale of the measurement.
It's maybe not quite as meaningful if we put all these on the same exact scale.
But I just wanted to show this for illustration,
as a way to compare everything side by side.
That's a GWAS.
Moving forward, let's get to probably what our main objective would be,
which would be to predict these traits as a function of the markers.
Here, we do have access to all the great predictive modeling platforms
that are in JMP.
Some of these, you have to be a little careful to use.
With missing data, you may need to do the imputation first.
Some might become quite slow given the size of the problem.
For today, I just want to show probably my favorite one,
which is XG Boost, using the XGB oost platform.
This is a case where I actually ran this beforehand,
because, and it's to run.. .
But I l oaded all six traits into XGB oost and did ten- fold cost validation.
I automatically left out each of the ten folds.
Here, you can see the results of that run,
where we've got the solid lines here in these graphs,
are the validation curves over the iterations
and the dotted lines of the training.
You can see with these wide problems, there's a severe risk of overfitting,
especially with a powerful approach like XG Boost.
You have to be very careful.
As you can see, I actually [inaudible 00:32:18] parameters.
I could tweak them down for one, and you can see the other parameters here.
Within each model fit, we've got both the training,
observed versus predicted, and the validation.
You can see here for C WAC we got a correlation of around 0.43.
Correlation is a typical measure used to assess performance.
This is competitive with what was published in the paper before,
without hardly much tuning at all.
But then there's a lot of other interesting things you can dive into,
the most important features, etc.
We even got some new things for instance, one thing called Shapley values
that I'd encourage you to check out.
There's going to be another talk on this topic by Peter Hirsch,
Florian Laura Lancaster and myself on that here at the conference,
I would encourage you to check that out.
It's a way to break down predictions into their components.
That gets another level you can go into with predicting.
That's just one example of some nice predictive modeling you can do.
To wrap up the demo, I wanted to return back
where we started here in this Genetics menu.
We've got a marker, a brand new marker simulation platform.
This is some pretty advanced genetic modeling
carried out by our internal expert, Luciano Silva.
What this does is it actually will do virtual crossing by the genotypes.
The idea is you'd load the markers in.
The really interesting thing is you can put a predictor formula here.
For example, I save the predictor formula from the XGBoost model of CWAC.
What this will do is both simulate the crosses
and predict their performance.
This is what modern virtual breeding does.
You can actually virtually cross different loblolly pine trees
and predict what will happen with them
without having to wait 10, 20, 30 years to grow them in the field.
Extremely powerful, interesting approach that revolutionized the way
modern breeding is done, and why so-called genomic selection,
or predictive modeling with genetic markers is so popular.
I'll go ahead and conclude there.
I hope that whetted your appetite
with some of the new things we've got going.
A lot of the things I showed today would also work with gene expression data,
although that's a little bit different ballgame
in terms of what you're trying to do.
But for sake of time, I thought it would be good just to look at this one example
and dive somewhat deep.
Thank you very much for your attention.
Let us know if you've got questions as you have them.
We're really e xcited about the new things coming in JMP 17 Pro.
We've got a lot more things coming in the works.
Thank you very much.
We recognize that lot of people that come to discovery,
this may not be their area of expertise.
But you may know somebody who's doing this work,
and we would love to get them connected with what we're doing here at JMP Pro,
because we are going to continue to invest in adding capabilities
and improving the software so it can do work like this better and better
to meet the needs of scientists across the life sciences
and this industry.
Thanks for listening in.