Note: JMP 11 launched last week. Today, we begin a series on Tuesdays of Big Statistics blog posts by John Sall about what has to change when you have Big Data, with an emphasis on screening.
Data preparation is a big part of an analyst’s job, and when you have big data, it often becomes a bigger part of the job.
I downloaded three years of birth records data from the CDC website, and despite having 12.8 million rows, it was fairly fast to make a graph of the birth weight of the baby by the age of the mother.
Among 12.8 million, you are going to encounter a few pretty heavy babies, say 8 kilograms (17.6 pounds). But notice that after a gap, there is line of 10-kilogram babies at the top. Babies really don’t come that big. If you click on one of these points and use a tooltip to see the details, you will see a value of 9999. That is obviously not a real weight, but rather a missing value code.
Before fitting anything, you will have to either recode these values to missing, or add a Missing Value Code property to the column properties so that the platforms will recognize it as a missing value code.
If you have more than just a few codes to fix, then there are a few more steps. The usual way is to select the points graphically, go back to the data table with the right column and rows selected, and then do the Delete command in the Edit menu, turning them into missing values. That works great when you have only a few columns to fix. But when you have to fix many columns, that is too tedious to do that way.
The semiconductor industry has data with hundreds or thousands of columns. The values are all collected by testing machines. Machines won’t give a bad value, will they? Look at the distribution of this machine-obtained column. Most of the values are below 0.5, but there are missing value codes of 9999. If you take care of the missing value codes, you learn the real distribution.
Below is another machine-collected measurement. Almost all of the values are less than 0.5, but there are values that are much larger – they are some kind of error codes from the measurement system. Since these codes are 11 orders of magnitude more than the real data, they are going to play havoc with any analysis even if you use robust methods, so set them to missing.
If you remove the large codes, you then see negative codes that were not apparent before. These are only three orders of magnitude different from the real data, so they were hidden until you removed the other outliers. Get rid of these, and you start to see the distribution. There are still outliers, but they look like real values, not just error codes. So let’s leave them alone and remember to consider robust methods later.
If you have this kind of data to analyze routinely, then you will want the cleaning to happen fairly automatically. The main characteristic here is that the error codes are very far away from the real values; you could use 20 times the interquartile range (IQR) and still find them all.
So I wrote a script that collected all the unique values that were 20 IQRs from the upper quartile, and similarly from the lower quartile. That captured most of the rogue values, but there emerged another problem. In a number of cases, most of the data was the same value, and the IQR was zero, as shown below. You can’t declare rogues when they are barely different from more than half the data.
So you have to go outward until you find some quantile range that is non-empty. Then you have a yardstick to measure by. I now use the smallest value greater than the upper quartile and the largest value smaller than the lower quartile, and this handles the cases where the data is bunched into constants in the middle.
The need here is for automation: You need to have scripts that automatically look for rogue values, report them, and if asked, repair them. Without automation, you are confined to repeating all the steps for all the columns in each table you get.
So here is my script, in parts.
First, I write the code that will be called from the dialog to loop through the variables selected from the dialog. It locates the column, gets all the values into a vector. Then it finds an extended quantile range. Then it locates all the values that are 20 quantile ranges outside the quantiles and are also integers. I know that all the error codes are integers. Then I accumulate the information about the “funny” values. If the Fix option is on, I replace the column values with missing.
cleanupDlg = NewWindow("Outlier Cleanup",
TextBox("Look for outliers to clean up.", Set Font Size(15),Set Wrap(400))
PanelBox("Cast Selected Columns into Roles",VListBox(
chkbox1=CheckBox("Replace with missing"),
chkbox2=CheckBox("Look for non-integer outliers too"),
HListBox(TextBox("Quantile tail: "),edtbox1=TextEditBox(Char(tail))),
HListBox(TextBox("Number of quantiles ranges beyond quantile:"),
ButtonBox("Columns to Check", colListCheck<
colListCheck = ColListBox(width(lbWidth),nLines(30))
checkCols = colListCheck<
nCheckCol = nItems(checkCols);
isFix = chkbox1<
isNonInt = chkbox2<
tail = Num(edtbox1<
reach = Num(edtbox2<
if (nCheckCol>=0, okAction; cleanupDlg << CloseWindow;);
colListData << Set Data Type( "Numeric" );
After testing the script, I make it into an add-in to use whenever I get a new batch of data. I get an output window showing for each variable what the extended quantile values were and which values were 20 ranges away from them.
Big data is usually dirty data, and when there is a lot to look at, you need to look for tools to make it much easier to clean up the data, getting it ready to analyze. The script I use in this blog post -- the Clean Up Error Codes add-in -- is now in the JMP File Exchange, ready to download if you need to make your data cleanup easier.