Big real data is different from big simulated data: Benchmarking
Nov 5, 2013 10:48 AM
To benchmark computer performance on statistical methods with big data, we can just generate random data and measure performance on that, right? Well, it could be that simulated data may not act the same as real data. Let’s find out.
Suppose that we are benchmarking logistic regression. So we create some regressors as random uniform and create a categorical response that is affected by the regressors. We use a million rows. Here is an example of a script to do that:
On my machine, this runs in 6.53 seconds. Not bad for a million rows.
I wanted to transfer this data to another system to compare it, but it was a nuisance to have all those decimal places in the random uniform numbers. The benchmark should work just as well if I rounded them to one decimal place, so I changed my data generator script by adding a Round function.
Now I run the same logistic regression with everything the same except for the rounding, and it takes 1 second. Yes, it runs six times as fast by using rounded data!
Does the machine have to work harder doing arithmetic with more decimal places? Of course not. It doesn’t even use decimal arithmetic. It is converted to binary, and the bits extend just as far as with the unrounded data.
So what accounts for the mystery?
Well, it turns out that with only 11 possible values for each X, since there are three X’s, there are only 1,331 unique combinations for the three variables. The Logistic platform actually figures out all unique combinations in the data. And when there are lots of matches, it is able to combine observations and just iterate through the unique combinations. So to logistic, most of the computing was done on 1,331 combinations instead of all 1 million rows. Thus, only a second instead of 6.53 seconds.
So what is the lesson for using Logistic? Even if you have a lot of data, it is the number of unique combinations of values that counts more than the number of rows.
And what is the lesson for using simulated data? If you expect a small number of levels, even if the variable is continuous, you should simulate with that small number of unique values, instead of a continuous random distribution.
Now with a small number of regressors, it is easy to take advantage of the unique combinations. But as the number of regressors increases, it is less and less likely that you can collapse the data into a small number of unique combinations.
Other Fitting Platforms
Do the other fitting platforms do this grouping? The Least Squares personality looks for unique combinations too, but not to speed up the calculations. It needs to find them for the lack of fit test. But it turns out that if you have tens of millions of rows, it can spend seconds getting this lack of fit test, and the test is not usually interesting for large cases anyway. So then, to speed things up, when the number of rows is 20,000 or more, it changes the default for lack of fit to be off, so that it doesn’t spend all that time doing something that isn’t usually wanted.
What about data in other cases, such as decision trees? Does it make a difference if the simulated data is unique or rounded? Yes, it does. When it is sorting by those X values, it takes advantage of ties by putting them off to the side so that they don’t need to be compared again -- so the fewer values you have, the faster the sort is.
What about clustering? Does it matter how the data is randomly generated for measuring speed?
I decided to generate four sets of simulated data to see if the details make a difference. I started with pure, independent, normal, random data: 10,000 rows by 50 columns. The time to cluster was 28.83 seconds.
Now, in the real world, data is not independent. It is correlated. So the next run was the same kind of data, but transformed to be highly correlated; each column had a correlation of .95 with the others. The time to cluster that was only 4.18 seconds -- more than six times faster!
Why would that be the case? It turns out that with those correlations, if you take principal components, 95% of the variation is in the first principal component. So if you were looking at distances between points -- and short-cut the distance loop when you got more than the distance you were comparing against -- then most of the time, it only has to look at the first element in order to know that some distance is not the smallest. In hierarchical cluster, we always do a singular-value decomposition (SVD) of the data before we cluster it so that the algorithm runs faster. The SVD is generally a lot cheaper than the hierarchical clustering, so this is a price usually worth paying.
But the real world has data in clumps, otherwise we would not be doing clustering. Does it go faster if the data is in natural clusters rather than spread more evenly throughout space? We generated a set of uncorrelated but clumpy data, where there were 100 well-separated clumps with 100 points in each clump. The time to cluster that was only 2.83 seconds. When the data is naturally clustered, the search for near neighbors is faster, because it organizes the space better.
What if you combine the clumpiness and high correlations? In that case, the time gets down to 0.98 seconds, 29 times faster than when we started, even though the data was of the same size and marginal distribution.
Now what if you performed these same benchmarks in a different clustering program? The classic hierarchical clustering algorithms are slow, always taking something proportional to n^3*p time. But with JMP’s use of SVD and a very fast near-neighbor version of Ward’s method, it is much faster than the classic method -- though it depends on the data how fast, while the classic method is much more consistent and, of course, much slower.
// Correlated Clumped Random 100 clumps, 100 per clump
data = DirectProduct(J(100,50,20*RandomUniform()-10),J(100,1,1))+.1*J(10000,50,RandomNormal());
dtClumpedCorr = AsTable(data*Cholesky(theCorr)`); data = 0;
startTime = TickSeconds();
timeClumpedCorr = TickSeconds()-startTime;
What about other places in JMP? Does it matter what the characteristics of the data are?
Yes, it does. Generally, things go faster with categories if there are not too many of them. If there are fewer than 5,000 levels of a category, it uses associative array methods to create categories and also takes advantage of multithreading. However, if there are more than 5,000 levels, and especially if there are millions of levels, the memory allocation burden gets expensive, and it uses sorting methods instead.
If you need to benchmark the performance of a method using simulated data, it matters a lot what properties you simulate with -- it is best to try to duplicate the properties of the data that you want to apply it to eventually.
In short, it depends; the details matter.
JMP tries pretty hard to take advantage of typical properties of real data.
Note: This is part of a Big Statistics series of blog posts by John Sall. Read all of his Big Statistics posts.