For anyone doing data analysis, it’s common to devote a large proportion of time to data cleaning. Real data is dirty data. Even if you are using automated collection tools, errors, anomalies and inconsistencies will still make their way into your data.
Over the next few blog posts, I’d like to show you some simple techniques for improving data integrity. I’ll talk about what to look for, and how to do it using JMP. In many cases, JMP has all the capabilities you need to perform the task. In a few cases, I’ve put together scripts to help with the cleaning. I’ll make these scripts available in the JMP User Community.
To put data cleaning into context, consider three general places where problems might occur. The first place you may run into issues is while building your data set. Missing delimiters, mismatched column names, and out-of-sync time stamps will make accessing, concatenating and joining data difficult. Once the data is in a JMP data set, you will run into a different set of problems: Incorrect data or modeling types, columns with no values, error codes and misspelled values are common. Finally, in the analysis process, you may find that you need to remove outliers, transform columns or aggregate variable levels.
Let’s start in the middle, the second area mentioned above. You’ve created your data set, but you need to clean it up before analysis. I suspect that this is where you’ll encounter most of your problems. To begin, we’ll focus on removing unneeded columns and getting the data and modeling types correct. Much of this can be accomplished with just the Columns Viewer.
To illustrate, I’ll use data I downloaded from the US Environmental Protection Agency’s website for automobile fuel economy. Seven separate Excel worksheets (one for each year in the data set) were downloaded and concatenated. You can find the data in the JMP User Community.
The first thing we can do is remove variables with little to no information. Columns where all values are the same (or, missing) are ideal candidates. Using the Columns Viewer (Under the Cols menu):
Any columns you select in the summary table will be simultaneously selected in the parent data table. This makes many column operations easier, by allowing you to preselect columns with particular properties before performing the operations.
Notice that the top three columns have all missing values and can be removed. Once you’ve selected them, Cols > Delete Columns will remove them from the data table. Note that the deleted columns will not be removed from the summary table, so it may make sense to rerun Show Summary (remember to reselect all the columns). You can tidy up the report by removing the old summary table (look under the hot spot next to the Summary Statistics title).
The next 10 columns have only one value. The decision to keep them depends on the proportion of missing values they have and whether you think their identification provides additional information. Two of the columns, Suppressed? and Police/Emerg? have no missing values (i.e., all the rows are identical), so they can be safely removed. Seven of the remaining eight columns are mostly missing (>99.997%) and may be good candidates for elimination. The remaining column Guzzler? has about 5% non-missing and could be worth keeping. (I’ll assume none of the columns are part of a variable that has been split across multiple columns. I’ll talk about finding and fixing this situation in a later post.)
In the single-category examples above, the majority of the data was missing. If, however, most of the data was nonmissing, a similar argument can be used to justify when to remove a column. If a column is made up of a single value, but has a large enough proportion of missing values, you may consider keeping it. Where those missing values are might be telling you something.
We can also remove multi-category columns with a high degree of missingness. To do this, right-click and sort the summary table by N Missing. Don’t check the Ascending box this time, since you only want to consider columns whose values are largely missing. Because these columns have more than one category, you may also want to examine them in more detail. To do this, select them in the summary table and click the Distribution button. As before, you’ll have to decide what proportion of nonmissing values justifies retaining the column.
When you are done with the categorical data, you can perform the same operations with the continuous data. In the filter pop-up dialog, reselect Continuous and deselect Ordinal and Nominal (remember that holding the Alt key before clicking the hot spot lets you make multiple selections). After displaying the summary, sort ascending by Std Dev (it’s on the far right). The columns where Std Dev equals zero only have a single value. There should be only one (Comb CO2 Rounded Adjusted). Sort by N Missing to find those columns whose values are mostly missing.
Once you’re done removing unneeded columns, you can move on to checking the data and modeling types. When JMP initially reads in data from other sources, columns containing only numbers are given the Numeric data type and Continuous modeling type. Otherwise, they are assigned the Character data type and Nominal modeling type. Occasionally, JMP will set a column to Numeric even when it has a few character values. If this happens, you will get a warning message like the one in Figure 3.
While the defaults in JMP work well in most cases, there are some situations you should know about. First, if a numeric column has extraneous text — “NA”s, error codes, symbols to flag unusual observations, etc. — the column may be read as Character/Nominal. Second, numeric values may correspond to coded values and not measured values. In these situations, you’ll want to treat the modeling type as Nominal. Finally, in some situations measured categorical values should be treated as ordinal. Fortunately, there are relatively few instances where there is a difference between modeling with Nominal data, and modeling with Ordinal data.
So we can keep working with the Columns Viewer; our next task will be finding categorical numeric columns. (Our other tasks will have to wait for the next post.)
To locate these columns:
As with the summary table in the report window, the resulting table is linked to the data table from which it was created. Because you’re using a Data View, you may need to deselect any columns that may have been previously selected in the parent data table. To find categorical variables, look for those columns that have a small number of categories and/or have a Min value close to 0.
While this approach to finding categorical numerical data is far from perfect, it is a quick and easy start. You can perform a simple check by selecting rows in the summary data table, returning to the parent table, then running Distribution on the selected columns. If you do this with columns having N Categories equal to 6 or less, you’ll notice that a few of the columns contain values with decimals. While it would have been nice to have excluded these columns from the start, Columns Viewer doesn’t currently provide for this, so we’ll do this with a bit of scripting.
We’ll talk about that and more next time.