Uncovering Gene Patterns in 3D Matrices of Single Cell Data: An Application of T...

Hello. Today I'm glad to present you scA2SIGN method to define the signatures for the convolution starting from single- cell data. We will make also a short demonstration of it. This work was done in collaboration with Paul Fogel and Alexey Fedorov.

Before to do the deep dive about our method, let's do the short introduction in order to explain you what is the cell deconvolution.

The cell type deconvolution is an silica method that allows to identify the cell proportions from bulk RNA sequencing data. This method requires to have specific signature matrixes containing the marker genes for each type of the interest.

Our method proposed to identify this signature matrix and start from the single-cell data because in single-cell data, as you can see here, we do have the transcriptomic profiles for each single cell from the data present in the tissue in the samples. Each dot here corresponds to the cell. Cells with the closed transcriptomic profiles, they are grouped together and form clusters.

We can go to the next slide, please.

Here you have a brief overview of our method. Our method is called scA2SIGN, which stands for single cell agnostic algorithm for the convolution. As was mentioned previously, it starts from the single-cell data ideally coming from several subjects, individuals, in order to cover biological variability that can be present between individuals.

The single-cell data need to be preprocessed and assembled in one data set. Here the cells with a close transcriptomic profile so they are assembled into the clusters. This representation in two dimensions is called UMAP.

From this single-cell data, we need to retrieve the matrices which are going for each individual. They are going to contain dreams on the rows and clusters on the columns. It means we need to aggregate the data and several matrices corresponding to each subject represents a tensor.

If we try to identify the signatures right away at this step, it's going to be difficult because some clusters of single-cell data, they close from transcriptomic al point of view. It can raise a problem of multicollinearity.

In order to deal with this, we run several rounds of non- negative tensor factorization in order to get the reduced tensor and assemble close clusters together in order to define the cell types with distant transcriptomic profiles present in the sample.

Once again, the structure of the data is going to be the same. We are going to show it in Jump application with dreams on rows, cell types on columns, and several matrices corresponding to several subjects. From this, our method proposed to two simple criteria to define the marker genes. This part is going to be covered by Alexey.

Last but not least, the same genes can be used in order to label cell types are defined by our method based on groups of the clusters. Last but not least, we can use a non- negative regression in order to perform the convolution on bulk RNA sequencing data.

With this we are going to talk a little bit more about non- negative tensor factorization in order to explain what is it exactly. Non-negative tensor factorization is a dimensionality reduction technique. As it was mentioned, the advantage of this method is that it allows to take into account the several matrices coming from several subjects and to do it at one step, comparing to other dimensionality reduction techniques like ICA, when you need to run several rounds of ICA for each matrice and next to find a consensus in the data.

As output is an input of the NTF, we use the matrices previously described as the output of the NTF. We get the components . We get the fingerprints for clusters here represented by W, the fingerprints for the subject represented by Q. In this fingerprint, they have the genes or the signature. Fingerprints means a set of genes with lower [inaudible 00:05:19] defining by H.

Each component corresponds a strong biological signal present in the data. If we make the sum of these components of these strong biological signals present in the data plus a small error, we can approximate the data which is represented by this equation.

With this we can go to the next step and to see exactly in the application how the data looks like.

Right now, we are in the Jump application and you can see the input data required for our method. On the left, you do see the eleven matrices coming from different donors. The matrices have different names and we used the idea of the donors as their names. In the tables, we have 17 columns corresponding to 17 clusters identifying in single-cell data. On the rows we do have the gene names preceded by the idea of the individual in order to distinguish different transcriptomic data coming from different donors.

Once we loaded the data into the system, we can launch our add-in and we are going to see how it looks like. It looks like a window with several parameters. These parameters are written in Python because the code behind use Python language. But you do not need to be afraid because we are going to change only one important parameter, which is number of components in here.

We suggest to start with the number of components equal to number of clusters present in the data and it's going to be equal to 17. Thank you. With this, we can simply click on the submit button and the application will run the algorithm behind. Next with this, I'm going to leave the floor to Paul in order to explain the output of NTF and to guide you following the next step of our method. Thank you for your attention.

Thank you, Garria. The output of the a dd-in consists of three tables with NTF components along each dimension of the tensor: cell clusters, subjects, and genes. First, let us look at the cell cluster component stage.

I open the table and each row on this table corresponds to a cell cluster in each column to a component. There are 17 rows and 17 components as specified in the add-in input parameter by Galina. Each entry in the table corresponds to the loading of a cell cluster onto a component.

Now, scrolling through the columns of the table, we see that some of the components contain many zeros like components 16, which is good because it indicates that these components approximate the expression data coming from the few cell clusters with nonzero loadings. Here we have exactly one cell cluster with a one, and all of us are zeros, or almost zero.

The fact that cell clusters have nonzero loadings on the same component indicates a high degree of collinearity, so regrouping of these clusters is advisable. We have an example here with component 12, where three actually cell clusters have nonzero loadings. Now, let us look at the subject components table.

I open it and scrolling through the columns of the tables, we see now that unlike the site clusters components table, the columns here tend to be dense, with only a few zeros here and there. This is good because it shows that most subjects contribute to each component. In other words, the expression data approximated by each component, or we call it fingerprint, applies to most subjects. In a sense, the fingerprint is universal.

Now, Jump has an excellent feature that allows you to view the histograms of the columns at the top of the table. Let's click on this icon here, and we now see all the histograms. We see that the columns are more or less dense anyway, and it will be good to be able to quantify the degree of sparsities of each column. How do we do that? To quantify the degree of falsity involves the cell cluster and subject components. We use the index represented by this formula.

This formula was introduced by Herfindahl and Hirschmann to evaluate the degree of concentration achieved by fields. For example, to avoid monopoly situations that can result from mergers. Originally, the inverse formula was published, but for simplicity, we will continue to refer to this index as the HHI index.

Similar indexes have been proposed more recently. They are mathematically equivalent, but this index has the advantage of being very interpretable as it gives the number of non-negligible entries in the vector which are highlighted here. In the example, we have a number of values, but only the highlighted values are non-negligible. The HHI index of this vector is five.

Returning to our workflow, we see how this index allows us to identify components that are sparse along the cell cluster dimension and dense along the subject dimension. We call these components sparse and robust . They are red crosses here with high subject HHI and low cluster side cluster HHI on the top of this field.

Based on these four components, we can make the necessary groupings of sight clusters to eliminate collinearities and the remaining cell clusters here... Sorry for this interruption. [inaudible 00:13:22]. I just lost [inaudible 00:13:22], I'll put it here.

The blue crosses here correspond to the remaining cell clusters that will be reintroduced in a second iteration of the analysis. Note well that this addresses with the strongest signal, after being identified by the NTF, components are removed from the data set and the next NTF starts over with the reduced data set.

To illustrate this process... I'm sorry but I have a problem with here with the... It keeps reappearing.

To illustrate this process, let's now look at how these iterations of the recursive NTF play out. We see here the HHI table from the first iteration which will help us decide which NTF components to keep and which to discover. As mentioned earlier, good components are those that are specific of only a few cell clus tures which translates in the low cell cluster HHI index. These are labeled rows here at the bottom of the table.

We also run one high degree of robustness to subject diversity and this translates into a high HHI index. How do we decide whether an HHI value is low or high? We can use that the Jump [inaudible 00:15:47] platform which is available in the specialist modeling platform. We run this platform and we apply a normal niche model to distinguish between low and high HHI values and stop the resulting classification.

When we do that, we see here the red crosses corresponding to the components that are selected because they have a high HHI subject index and a low HHI cell cluster index and they are automatically identified as such by the Jump Normal platform.

Let us look now at the loadings of the cell clusters on these particular components . We see on component 15 and component 16 that cell cluster 15 and cell cluster 17 has exclusively associated these components so they can be well approximated by these components. In contrast to these cell clusters, on component 13, we see three- side clusters regrouped along this component. This means that they are correlated somehow, it will be hard to distinguish them, so it's better to regroup them. Similarly on component 12, with the cell clusters 5, 2, and 14.

We can now look at a second iteration of the Recursive NTF. Similarly we look at the components HHI on the subjects and on the cell clusters and we see now that only one component, component four, satisfies these retrainments requirements for being retained during this iteration. If we look now at the components audience, we see that cell cluster seven is exclusively associated with this component. Good.

Now that we have done that, we are not going to show you all of the iterations. Actually, there are four iterations and we end up now with these groupings that we see here. We have actually nine cell clusters after regrouping. Some of them are groupings of three or four original cell clusters. However, five cell clusters did not need to be regrouped. What is the biological meaning of these groupings?

By matching the most highly expressed genes in each cell cluster with bioinformatic databases, it is possible to associate the cell clusters with known biological cell types. For example, the NTF groupings 2- 5-14 was identified as a natural killer CD8 cell type.

We found that there were several groupings that actually pointed to the same major key cell type. Therefore, we decided to merge them and came up with five main cell types. However, it is still possible to extract signatures for [inaudible 00:20:12] cell typ es that can be used in a second stage of the convolution. But we will come back to this point at the end of this presentation.

Let us now look at the resulting reduced tensor. I'm just opening the table. You see here the table, and as expected, there are now five rows corresponding to the regrouped cell clusters.

Now that we have introduced cluster grouping to minimize collinearity, I will have Alexey explain how we proceed to create signature genes that are most effective for decomposition.

Once the cell clusters are grouped, we do still need to create gene signatures for each of the groupings to ensure accurate convolutions of the mixed samples. So far, we have considered the cell and subject component vectors. Let us now consider the gene component vector.

The table we see here is not really informative. We can only see zeros everywhere due to the fact that only small minority of genes have actually nonzero loadings on the NTF components. If you can go to the next iteration, please, Paul.

Let us now examine the loadings of the genes on this representation of the components which we've called sparse and robust.

These were selected during the first iteration of the recursive NTF. We see here in red that some of the distinctive genes are shared by all components. We know well that each component is specific to a particular cell cluster or a group of closed cell clusters. This means that some of the genes are actually strongly expressed in several clusters with some degree of linearity between the cell clusters.

This is exactly how niche works. Biological parts are not necessarily orthogonal to each other. Therefore, we cannot rely solely on NTF gene components to identify gene markers that are unique to a cell cluster or a group of this, we must return to the actual expression data.

We are getting back to expression data and we will introduce two criteria. The first one for a gene to be considered the marker gene of a particular cell cluster, we say that it should always be more expressed in the cluster than in other clusters, regardless of which subject is considered. This is shown in figure bottom left, where we have depicted two genes.

With some [inaudible 00:23:20], we see that these points on the left correspond to a unique marker gene for B cells. We can see only B cells everywhere. On the right, a non-marker gene is highly expressed in both B cells, non classical monocytess and natural killer cells.

The second criteria will be that the associated cell cluster should not only be at the top of the other clusters. When we look at the expression of the market gene over [inaudible 00:23:55] subjects and cell types, but also it should be well ahead of others.

For this, we use this score with a formal issue on the right that were proposed by [inaudible 00:24:09] in their original paper. This score reflects the specificity of a gene marker. This score is based solely on the distribution of gene expression across subjects and cell types. Thus applying this to criteria, we obtained a list of 173 marker genes that can subsequently be used for the convolution.

In this table, they are listed together with the values that meet the two criteria I've talked about.

Let us now look at the table of signature genes for each of the subject and major cell types. Here we see the list of genes and the columns correspond to each other's subjects and the same list of cell types for each of the subject. In contrast to the table of NTF loadings that we showed previously, this table is not sparse, so it is limited only to 173 marker genes that we selected according to our criteria with [inaudible 00:25:26] gene. As I said, the columns are for subject cell types.

We will now visualize how the signature genes are expressed across subjects. For that, we will use the second plug-in we developed which is called adrubix. This will allow us to easily assess the robustness of the gene signatures we got.

We open a small window with a list of parameters in Python syntax as for the first plugin. The default values have been already filled in and we will now only fill in for demonstration, the value for the parameter which is called scale along.

Since the expression ranges differ significantly between genes, we will set this parameter to zero to have all the columns scaled and we submit. Just a few seconds and we see a picture with distinct signatures for each major cell type. However, we see that this heat map is glued together in between subjects and we can actually , with this plug-in, very easily add separators in between the columns for each of the subjects.

For this, we change the value of the parameter metadata cols sep to subject and we resubmit. Now we get the separate heat maps for each of the subjects. This is much clearer and as you've seen, this small window contains numerous parameters. You can refer to the documentation for the underlying Python package which is available clicking on this link.

Conclude, we see here actually that the signature genes are very robust. We see very little difference in different subjects. Like that we can visually very easily confirm the robustness of [inaudible 00:28:09] .

Now we are nearing the end of our presentation, I will let Galina present some deconvolution results that were obtained with the gene signature [inaudible 00:28:22] . Galina.

Yes, thank you. Last but not least, we would like to show you the performance of our method. To evaluate the performance of our method, we use the published data coming from White paper so mentioned in hints his reference. This data set containing 21 different cell mixtures. There are the mixtures of different T cell subtypes, NK cells, monocytes and B cells.

In these mixtures, we have also three different cell types not covered by our signatures corresponding to colorector or breast cancer cells and two cell lines like fibroblaster and the tail cells.

We decided to use two criteria evaluates overall performance of our method Pearson correlation coefficient or [inaudible 00:29:18] a concordance coefficient because we can get very significant correlation between estimated and ground truth proportions, but we can have the change in the scale identifying the quantity of cells present in the sample. That is why we use two criteria.

Overall, we obtain good results for cell types covered by our signatures. The results are slightly not so beautiful for T cells because we don't have exact correspondence between T cell subset present in the mixtures and T cells group presented in our signatures. We have also effecto r cells and in the mixtures, this cell type was not added.

Last but not least, we would like also to mention with our method you can go beyond and identify the subtypes of T cells, the monocytes running recursive NTF and defining the marker genes for each big group of immune cells. Also that these add-ins that have been presented and demonstrated today, they are available in Jump application.

With this, thank you for your attention.

Uncovering Gene Patterns in 3D Matrices of Single Cell Data: An Application of Two New JSL Add-ins (2023-EU-30MP-1327)

Presenter

Files