Subscribe Bookmark



May 27, 2014

Big data = Dirty data

Note: JMP 11 launched last week. Today, we begin a series on Tuesdays of Big Statistics blog posts by John Sall about what has to change when you have Big Data, with an emphasis on screening.


Data preparation is a big part of an analyst’s job, and when you have big data, it often becomes a bigger part of the job.


I downloaded three years of birth records data from the CDC website, and despite having 12.8 million rows, it was fairly fast to make a graph of the birth weight of the baby by the age of the mother.



Among 12.8 million, you are going to encounter a few pretty heavy babies, say 8 kilograms (17.6 pounds). But notice that after a gap, there is line of 10-kilogram babies at the top. Babies really don’t come that big. If you click on one of these points and use a tooltip to see the details, you will see a value of 9999. That is obviously not a real weight, but rather a missing value code.


Before fitting anything, you will have to either recode these values to missing, or add a Missing Value Code property to the column properties so that the platforms will recognize it as a missing value code.


If you have more than just a few codes to fix, then there are a few more steps. The usual way is to select the points graphically, go back to the data table with the right column and rows selected, and then do the Delete command in the Edit menu, turning them into missing values. That works great when you have only a few columns to fix. But when you have to fix many columns, that is too tedious to do that way.


The semiconductor industry has data with hundreds or thousands of columns. The values are all collected by testing machines. Machines won’t give a bad value, will they? Look at the distribution of this machine-obtained column. Most of the values are below 0.5, but there are missing value codes of 9999. If you take care of the missing value codes, you learn the real distribution.











































Below is another machine-collected measurement. Almost all of the values are less than 0.5, but there are values that are much larger – they are some kind of error codes from the measurement system. Since these codes are 11 orders of magnitude more than the real data, they are going to play havoc with any analysis even if you use robust methods, so set them to missing.























If you remove the large codes, you then see negative codes that were not apparent before. These are only three orders of magnitude different from the real data, so they were hidden until you removed the other outliers. Get rid of these, and you start to see the distribution. There are still outliers, but they look like real values, not just error codes. So let’s leave them alone and remember to consider robust methods later.























If you have this kind of data to analyze routinely, then you will want the cleaning to happen fairly automatically. The main characteristic here is that the error codes are very far away from the real values; you could use 20 times the interquartile range (IQR) and still find them all.


So I wrote a script that collected all the unique values that were 20 IQRs from the upper quartile, and similarly from the lower quartile. That captured most of the rogue values, but there emerged another problem. In a number of cases, most of the data was the same value, and the IQR was zero, as shown below. You can’t declare rogues when they are barely different from more than half the data.



So you have to go outward until you find some quantile range that is non-empty. Then you have a yardstick to measure by. I now use the smallest value greater than the upper quartile and the largest value smaller than the lower quartile, and this handles the cases where the data is bunched into constants in the middle.


The need here is for automation: You need to have scripts that automatically look for rogue values, report them, and if asked, repair them. Without automation, you are confined to repeating all the steps for all the columns in each table you get.


So here is my script, in parts.


First, I write the code that will be called from the dialog to loop through the variables selected from the dialog. It locates the column, gets all the values into a vector. Then it finds an extended quantile range. Then it locates all the values that are 20 quantile ranges outside the quantiles and are also integers. I know that all the error codes are integers. Then I accumulate the information about the “funny” values. If the Fix option is on, I replace the column values with missing.



okAction = expr(
text1 = "Quantile Tail=" || Char(tail,5) || " Reach=" || Char(reach,5);
For( icol = 1, icol <= nCheckCol, icol++,
col = Column( checkCols[icol] );
data = col<
// Guarantee a non-degenerate IQR by going just outside it
QUpper = Minimum(data[Loc(data >Quantile(1-tail,vec))]);
QLower = Maximum(data[Loc(data
LimitUpper = QUpper+reach*(QUpper-QLower);
LimitLower = QLower-reach*(QUpper-QLower);
funnyRows = Loc(((vec>LimitUpper)|(vec
nFix = nrow(funnyRows);
if (nFix==0,continue());
text1 ||= "\!n\!nColumn: "||(col<
||"\!n Lower Quantile="||char(QLower,7) || " Lower Threshold=" || char(LimitLower)
||"\!n Upper Quantile="||char(QUpper,7) || " Upper Threshold=" || char(LimitUpper)
||"\!n Outlier Values= ";
text1 ||= char(data[funnyRows]);
if (isFix,for(i=1,i<=nfix,i++,col[funnyRows]=.;));
NewWindow("Cleanup Items",ScriptBox(text1,"TEXT",600,600));


The rest of the code is setup and dialog code:



cleanupDlg = NewWindow("Outlier Cleanup",
TextBox("Look for outliers to clean up.", Set Font Size(15),Set Wrap(400))
PanelBox("Select Columns",
PanelBox("Cast Selected Columns into Roles",VListBox(
chkbox1=CheckBox("Replace with missing"),
chkbox2=CheckBox("Look for non-integer outliers too"),
HListBox(TextBox("Quantile tail: "),edtbox1=TextEditBox(Char(tail))),
HListBox(TextBox("Number of quantiles ranges beyond quantile:"),
ButtonBox("Columns to Check", colListCheck<
colListCheck = ColListBox(width(lbWidth),nLines(30))
checkCols = colListCheck<
nCheckCol = nItems(checkCols);
isFix = chkbox1<
isNonInt = chkbox2<
tail = Num(edtbox1<
reach = Num(edtbox2<
if (nCheckCol>=0, okAction; cleanupDlg << CloseWindow;);
ButtonBox("Remove", colListCheck<
colListData << Set Data Type( "Numeric" );

After testing the script, I make it into an add-in to use whenever I get a new batch of data. I get an output window showing for each variable what the extended quantile values were and which values were 20 ranges away from them.



Lower Quantile=-0.0321 Lower Threshold=-1.98795621097088
Upper Quantile=0.06569 Upper Threshold=2.02154020965099
Outlier Values= [9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999]
Column: PS_RPNBR
Lower Quantile=5462.67 Lower Threshold=-10461.529296875
Upper Quantile=6258.88 Upper Threshold=22183.0791015625
Outlier Values= [421034, 422334, 423613, 423085, 425256, 420978, 422327, 423525, 423021, 425233]
Column: RCON_PEM
Lower Quantile=0.162 Lower Threshold=-1.39799988269806
Upper Quantile=0.24 Upper Threshold=1.79999987781048
Outlier Values= [22539399168, 22751399936, 27257200640, 26248300544, 28104099840, 565313024]


Big data is usually dirty data, and when there is a lot to look at, you need to look for tools to make it much easier to clean up the data, getting it ready to analyze. The script I use in this blog post -- the Clean Up Error Codes add-in -- is now in the JMP File Exchange, ready to download if you need to make your data cleanup easier.

Community Member

Mike Clayton wrote:

Agree that error codes are often integers, so for continuous data and good gage capability, those integers can be filtered out, and explained by test engineering or other teams. Bad gage capability can give phony "integers" in certain ranges of values.

However, sometimes all integers are given false extra digits by the test file generation process, or the file conversion process, So detectable error codes have to be created to survive all those binary to analog conversions, and file parsing methods, to remain detectable as far integers.

Next, of course, there are marginal devices that are also many IQR multiples outside the rest of part distribution, and if tolerance limits are too loose [due to limits set by designers rather than set statistically after pilot production and customer feedback], those "maverick devices" must be identified and contained and that can be part of the data "cleanup" or the next step in analysis.

Good idea to keep all the outliers in data sheet, "excluded" temporarily, for team validation, by product, for those batches that have large number of such outliers.

Thanks for the script to catch far outlier integers.

Community Member

Five Links: Data Stuff | HR Examiner wrote:

[...] Big Data is Dirty Data This week in the HRExaminer, weâ ve started looking at the work associated with cleaning data streams so that they can be used to assemble an aggregate picture. This short piece talks about another aspect of the problem. Each data source tends to generate inline error messages. In order to speot them in the flow, you have to know what they are. Otherwise, you get really strange results. [...]

Article Tags