Superposition Magic - How can you identify several clusters that are at the same place at the same time?
If physicists can have their superposition magic, then so can statisticians. Suppose that you have lots of points across three variables in three groups. You need to count the number of points in each group. But you don't know which group each point comes from. And, by the way, each group has the same mean across all the variables. Try inventing an approach to do that before you read the rest of this blog entry.
Grouping points is usually the job of cluster analysis. The computer scientists have a more colorful name for this: unsupervised learning. It's easy. You just cluster the data so that points that are near each other form clusters; assign each point to the closest cluster, count the points in each cluster and you are done.
But these clustering methods never allow the clusters to overlap, much less have the same centers. So ordinary clustering just won't do this job.
Before we solve the problem, first we have to be a little more specific about the data and generate a problem data set. We will have the data have multivariate normal distributions, and though each cluster will have the same mean, we will distinguish them by having a different covariance structure for each group. To be simple, we will generate uncorrelated multivariate normal data, with each of the three clusters having a larger variance in one component direction that is unique to that cluster.
JSL script to generate data
So we generate a table, in this case with variances of 1 for each variable for each cluster, except that each cluster has one component with a variance of 4, instead of one. The result is that each group sticks out in a different direction, though they have the same centers. Here is the JSL code that I used to make some data:
Normal ellipsoid contours for 3 groups
And here is a picture of the multivariate normal density contours that results from doing this:
Notice that the groups with the large X and Y variances have 35% of the data, and the group with the large Z variance has only 30% of the data. If the method estimates the proportions well, then the problem is solved. We don't need to identify each point--we just need to estimate the proportions, or equivalently, the numbers in each group.
Here is the secret: Instead of doing usual hierarchical or k-means cluster, we do normal mixtures. That is we fit the means, variances, covariances, and relative proportions of each group so as to maximize the likelihood of the data. This was implemented in JMP by Chris Gotwalt. Here are the results.
Normal Mixtures Results
Notice how well we did. The proportions are .343, .351, and .308 for the three groups, very close to the proportions used to generate the data, .35, .35, .30. And the means and standard deviations and correlations are close, too. Problem solved. Here is a picture of the data with the points colored by the most-probable group membership.
Points colored for most-probable group
Is this magic useful? It turns out that there is a very important type of counting that is very important to do in measuring the infection density in HIV cases. There is a special kind of white blood cell - a helper T cell - that expresses a protein called CD4. HIV infection is measured, in part, by how many of these cells are present in the blood samples, relative to other leucocytes. It turns out that you can make different types of white blood cells identify themselves by tagging their binding sites with different fluorescent dyes. Then you send it through an instrument called a flow cytometer, which makes a tiny jet of fluid droplets containing the cells. Several lasers of different wavelengths are shot through the droplets, and each droplet is measured how it fluoresces. The result is a huge data set, maybe half a million rows on 12 or so intensity measurements at different wavelengths. The data has a row for each cell. The different cells form clusters that overlap, and you need to count the cells in each cluster. The current practice for doing this is by hand-dragging polygons over the clusters, arbitrarily dividing the groups by eye. Using normal mixtures to do the counting would give a much more objective method, and improve the data reproducibility. But there are lots of stray points that don't belong to any cluster. No problem. Chris Gotwalt's routines include a (Huber) robust method of handling outliers. Doing this in 12 dimensions for half a million points is currently pretty expensive, so we are looking for ideas on how to speed this up.