Building A Single-Cell RNA-Sequencing Workflow with JMP® Project (2020-US-30MP-603)

3 Kudos

Level: Intermediate

Meijian Guan, JMP Research Statistician Developer, SAS

Single-cell RNA-sequencing technology (scRNA-seq) provides a higher resolution of cellular differences and a better understanding of the function of an individual cell in the context of its microenvironment. Recently, it has been used to combat COVID-19 by characterizing transcriptional changes in individual immune cells. However, it also poses new challenges in data visualization and analysis due to its high dimensionality, sparsity, and varying heterogeneity across cell populations. JMP Project is a new way to organize data tables, reports, scripts as well as external files. In this presentation, I will show how to create an integrated Basic scRNA-seq workflow using JMP Project that performs standard exploration on a scRNA-seq data set. It first selects a set of high variable genes using a dispersion or a variance-stabilizing transformation (VST) method. Then it further reduces data dimension and sparsity by performing a sparse SVD analysis. It then generates an interactive report that consists of data overview, variable gene plot, hierarchical clustering, feature importance screening, and a dynamic violin plot on individual gene expression levels. In addition, it utilizes the R integration feature in JMP to perform t-SNE or UMAP visualizations on the cell populations if appropriate R packages are installed.

Auto-generated transcript...

Speaker	Transcript
Meijian Guan	All right. Um, hi, everyone. Thank you so much for attending this presentation. I'm so happy that I have this opportunity to share the work I have
	have been doing with JMP Life Science group and SAS Institute. So today's topic is going to be building a single-cell RNA-sequencing workflow with JMP Project. So this is a new feature I developed for JMP Genomics 10. If you don't know what is JMP Genomics, I will give you a brief
	overview about it and the JMP project is a new feature, released on 14 and it's very nice tool can help you to organize your reports. So we took advantage of this new platform and organized a single-cell RNA-sequencing
	workflow into it. So first of all, I just want to give you a little bit background about JMP, JMP Genomics is
	one of the products from JMP family is built on top of SAS and JMP Pro. So it's taking advantage of both products which makes a very powerful analytical tool.
	So it's designed for genomic data, so it can read in different types of genomic data, it can do preprocessing, it can handle next generation sequencing that analysis.
	It is really good at differential gene expression and biomarker discovery, and many scientists using it for crop and livestock breeding. So it's a very powerful tool. I encourage everyone to check it out if you are doing anything related to genomics.
	And next thing I want to share with you is the single-cell RNA sequencing. Many of you may not be very familiar with it.
	So this is a relatively new technology used to examine that on a level from individual cells.
	And comparing to the traditional RNA sequencing technology which is survey, the average expression level of a group of cells.
	This, this new technology provides a higher resolution of cellular differences and it gives you a better understanding of the function of the individual cell in the context of its micro environment.
	And it can help to do a lot of stuff like uncover new and rare cell populations, track trajectories of cell development, and identify differentially expresed genes between cell types. So it has very wide application.
	One application recently is scientists using it to combat Covid 19 so it because it can be used to
	characterizing transcriptional changes in immune cells and how to develop the vaccines and treatment. Also in addition to that, it's widely used in cancer research and widely used in immunology and in many other research fields. Um, so it's very powerful tool, but
	it does have some challenges to analyze that data, so that's why we put together this workflow.
	Just wanted to give you an overview of the top line of the single-cell RNA sequencing. So the first thing you need is to get a sample,
	either from human or from animals. It could be a tumor or lab sample. And then you can isolate those samples into individual cells.
	So after you isolate, you can do sequencing on every individual cell for all the genes you have. For example, in humans, we have about 30,000 genes. So the final product will look like this in our read count table. We have genes...30,000 genes in rows and we have
	about sometimes half million cells as columns. So as you can see, meeting these very large data set has very high dimensions. Also you can notice the zeros in the table because
	Because of the technical or biological limitations, there's no way we can detect every single gene in every single cell. So it's not uncommon to see 90% of cells actually are
	zeros. So it's very sparse. Sparsity is another challenge when you analyze single cell RNA sequencing data.
	But after you do preprocessing, cleaning up, and do dimension reduction, you can apply regular, like clustering and principal components, differential gene expression analysis on this data. So those will be mentioned in my workflow.
	And I already mentioned this out that that I noticed challenges, including high dimensionality, high sparsity, and also there are varying heterogeneity across cell populations.
	Technical noises and reproducibility, since there are so many different sequencing protocols so many different analytical packages.
	In R or Python or other tools, it's very hard for you to follow exact steps to analyze your data. And if you mixed up the steps and didn't do things in correct order,
	you may not be able to get a reproducible results. So that's one of the problems that we tried to solve here.
	Just want to show you an example of single-cell RNA-sequencing data. This data will be used in my demonstration and it's a reduced blood sample data or ppm. See that I said we have cells in rows and genes in columns so it's
	about 8000 columns and 100
	rows, which would mean cells. And you can see those zeros, pretty much everywhere. I, I believe it's more than 90% of sparsity in this specific data set.
	So what's in our new single-cell RNA-sequencing workflow in JMP Genomics 10? So for this workflow, we tried to build it
	for those people who do not have very good technical background or not, do not have time to learn how to code and all those statistics. So in this workflow we put those steps in the right order for users to automatically execute the other steps in the workflow. And we also
	provide a very interactive reports to help users navigate with us and change the parameters and check outs different selections.
	So what's in this workflow, including data import progress, preprocessing and we have a variable gene selection method, which is the backbone of this workflow actually.
	So it for variable gene selection that the goal for this method is to reduce a dimension of the genes.
	Because for humans sample, we have 30,000 genes and not all of them are informative. So we try to pick the most informative ones.
	dispersion method and
	variability stabilizing transformation method based on lowest regression. So, I will not go into the details, but these two methods are widely used the research community and I'm pretty happy that we were able to reproduce them.
	And we also apply sparse SVD to further reduce the dimensions. And so we also applied hierarchical clustering and a k means clustering.
	We have feature importance screening using a boosting forest method in JMP and if you have R packages installed, we are directly call out to T-SNE and UMAP visualization
	which is very popular using a single-cell RNA-sequencing analysis. And we also provided some dynamic visualizations including violin plot, ridgeline plot, dop plot; we also do differential gene expression. And so all the reports will be organized in a very integrated reports with JMP project.
	So next I will do a demo. There are two goals in this demo. First one is to classify the cell populations in this PBMC data set. We try to find what are the cell types in this data set. The second goal is just to identify differentially expressed jeans across subtypes and conditions.
	So first of all, let's go to JMP Genomics starter. So JMP Genomics interface looks quite different from regular JMP but
	it's pretty easy to navigate. If you want to find the workflows basics
	and basic single-cell RNA-sequencing workflow lives here, you click that you can bring up this interface. So the interface is pretty intuitive. I'll say you just provide a data set.
	And you specify the QC options. What, what kinds of genes or cells do you want to remove for your analysis and variable gene selections. Which one method that you want to use, right. If you select a VST, you can also specify the number of genes you want to keep, 2000 or 3000
	And the clustering options, right, how many principal components you want to use for the clustering and
	either you wants hierarchical or k means clustering algorithms. And the more options, we have marker genes
	to help you to add a list of marker genes you want to use to identify the cell populations, which is very handy tool here. And you can launch ANOVA and differential expression analysis. So this is a separate report.
	I will not discuss this in this talk. So another thing we had is experiment example, right. If you add that basically you can provide any information related to start design like treatment information,
	sex information. So this is
	the simulated data here. I would just want to show you how we're gonna compare the gene expression levels and different measurements between groups.
	And finally, we have embedding options which can call out to t-SNE or UMAP R packages if you have them installed. You can change different parameters for this to our algorithms.
	So after you specify all those options, just go to run and then you have the report that looks like this one. So this is a
	tabular report. There are a total of seven tabs in this report. I organized them in the order that you want to
	how many genes in
	in the cells, how many read counts or what's the percentage of mitochondria gene counts in your data and the correlations between this three measurements.
	And we, you notice this left side, we have the action box. You can expand it and find options in this, in this box you can do many things with it.
	In in this tab, specifically, you can split the graph, based on the conditions you provided in the experimental design file. For example, we can do a treatment and we split
	Drug1, Drug2, placebo. Then you can see if there's any difference between different groups, right, and we can unsplit if you want to go back to our original plot.
	And the second tab is variable gene selection, which is the backbone of this workflow. The red dots mean those genes I selected for subsequent analysis and these
	gray dots are the genes that will be discarded in analysis. And if you expand action box, you can see, since we use the VST, we specified 2000 genes
	in this analysis. But if you change your mind, you can, you can, whenever you change your mind, you can type in a different number of genes and then click OK. So all the tabs will be refreshed as based on this new number.
	So after you have a list of variable genes, what you are going to do is to further reduce the dimensions by performing sparse SVD analysis, which is equivalent to principal component analysis.
	So after you apply SVD analysis, you can plot out the top two SVDs or principal components. Try to check the global structure of your data set.
	So in this case, we can see there are two big groups in this data set, which is interesting. And also we provide a 3D plot to help you to further explore
	your data. Sometimes there's, there are some insights that you cannot, that you cannot identify in a 2D plot; 3D plots sometimes can really provide additional value.
	And we have those SVDs, depending on how many you selected (20 or 30), you can use them to
	perform clustering. In this case we selected hierarchical clustering and we find nine clusters in your data set.
	In addition to dendogram, we also offer a constellation plot
	which I really like because this plot is similar to t-SNE or UMAP. It gives you a better idea about the distance between different groups, right.
	For example, there are three groups, big groups, three clusters
	Kind of distinct from other groups. And if you want to see where are they in the global structure here, this is the top three clusters I detect. Look at 3D plots, we can see
	all those highlighted ones are over here. And then go to 2D again, this is one of two big clusters, you know, that I said I highlighted so it's interactivity really help you to
	observe and visualize your data in multiple ways. And we also provide a parallel plot to help you to further identify the different patterns across different groups.
	And the next tab is embedding, which means t-SNE and UMAP parts if you have R packaging installed. I will
	call out to R, run the analysis, and bring back the data and visualize it in JMP. So here is a t-SNE plot. We have nine clusters very nicely being separated.
	On the bottom is exactly the same plot but this time I colored them with the marker genes you provided; we have 14 marker genes. So you can, using this feature, switch to click through
	to see where these genes are expressed, right. For example, there's a GNLY gene, highly expressed in this little cluster and we are wondering, what's our data? We select and go back and now we see all, most of them are from cluster nine, cluster eight.
	So GNLY is a gene for NK cells. This is a marker gene for NK cells. So now we have idea about what a group of this cell is, right.
	And also we have action buttons here, help you to do more things. If you want to switch to UMAP, if you prefer that, you can do it. Now the plots associate to UMAP.
	Exact same thing but UMAP does give you a little better, a little bit better separation and it can preserve more global structure in the visualization. And also we can provide some ways help you to remove the cells that might be
	contaminated or have some quality problems. For example, we don't like a group of cells here, we can remove them from the visualization. Make it cleaner, but you can always bring them back.
	And again, we can split the plots, which is a split graph button. This time, we can split by the gender, we split by female, male. Right, we can compare the gene expression level across the gender groups, which is pretty useful sometimes.
	And we are split.
	And the next tab is providing you more visualization tools to visualize gene expression levels in the nine
	groups, nine group cells, right. So the first part is called a violin plot. Again, we have a feature switcher help you to go through
	all those genes in different clusters, right. Now you can see, depending on how, how tall the part of the graph is and what its density is.
	You can clearly see where those genes are highly expressed. For example, again, we give example I'm using this gene, GNLY, you can see it's highly expressed in cluster eight.
	And in the middle, the second plot we are providing you is ridgeline plot. A ridgeline plot is organizing the
	clusters on the Y axis and the gene expression level at X axis. But it's basically providing you a similar thing depending on what you like.
	For example, GNLY, again, we can see cluster eight have GNLY highly expressed by knot or other clusters.
	And the bottom we have another plot called dot plot. This is the new plot we just added to this report. In addition to showing you that gene expression levels, dot plot can also show you the percentage of the cells expressing that gene. For example, take a look at this place,
	PPBP gene. And we can see this cluster, in cluster seven we have had, we can see 100% of cells in cluster seven expressing this PPBP gene. So this gene is the marker gene for
	PPBP cells actually. So now it's very clear that cluster seven is is one type of blood cell, which is a PPBP cell and they take a look at other, like for example, cluster two.
	There's only 12% of the cells expressing this gene. So there might be some contamination but this group of cells is definitely not PPBP cells. So this plot just showing you
	the expression level and expression percentage in each cluster which offers additional information in the plot.
	So next tab is also very useful. It's called feature screening. What I did was a fading boosting forest algorithm and then using that genes to predict the clusters. So the most important genes which contributed to the separation of the cells are ranked in this table.
	And the correct way to to view these genes is to open this action box. You select this top maybe top 35 genes you want to visualize. You click OK.
	So the next tab will show you only 35 genes you selected. So those are the genes, mostly informative. Right. They, they can explain
	why those different groups of cells are separated. So again, we just switch or you can click through and try to see the patterns. And then if you notice, there lot of genes LYZ,
	CST3 and NKG7. Those are already in the marker genes or were provided, which means this feature screening
	method is really successful to pick up those most important genes in your data set. And another thing, you can do visualization is through GTEx database. The GTEx is a tissue-specific database.
	Tell you what genes expressed in which tissue in your human body. So we can directly send the gene list to the database. You just click OK. We will open the
	website and GTEx website will provide you a heat map, right, so with the top 35 genes. Now you can see where I've expressed in those two human tissues, organs, which is very convenient to see additional information.
	So now we've those marker genes being used, you probably can identify what group of cells are they, right. So there's one
	function here is called a recode. What it does is, you open it, now you can recode those numbers into actual cell names, right. For example, eight we already know it's NK, it's NK cells. And we can do...I already have names for every single one of them. So I just type in
	Those
	Monocite.
	Two is DC cells actually; three is FCGR3A+ monocite.
	These are Naive CD+ T cells.
	Group five Memory CD4+ T cells.
	And CD8+ T
	Meijian Guan
Meijian Guan	for group six; seven is PPBP, as we already saw and the ninth is B cells.
	So with those recode we click recode. Now since all the plots and the tabs are connected, now you can find all the numbers have changed into actual cell names. So it's just help you to explore your data in
	easily, right. You can know where what those cells are and you can again do some exploration on and in your plots. And again, including this clustering plots, you know, see the custom name has been changed into the actual cell names. Um, so
	That's it for today's topic. And if you have any questions, you can send me an email or leave a message on the JMP Community. Thank you so much for your time.