Subscribe Bookmark



Jun 23, 2011

The Sum of the Parts Is More Than the Whole . . .

In an earlier post, I mentioned a new JMP add-in that can be used to split data prior to predictive modeling. This post deals with how this works and so necessarily touches on the topic of JMP Scripting Language, or JSL for short.

This is a big topic, so here I want to just show the anatomy of a JSL script that does something somewhat useful. This post assumes that you have downloaded and installed the add-in and are familiar with what it does. (Note: Downloading the add-in requires a free SAS login.)

If you select the menu item ‘View >Add-Ins…’ in JMP 9, you should see ‘Stratified Split’ listed in the combo box. Choosing this item gives you a link to the ‘Home Folder’ of the add-in, and selecting this link and double-clicking on ‘stratifiedSplit.jsl’ opens the source code in JMP’s editor. To follow along, be sure you have opted to display line numbers in the editor (‘File > Preferences’, and then select ‘Script Editor’ as the ‘Preference Group’).

The JSL can be deconstructed as follows:

(1) Lines 11-13: The default proportions into which the data is to be split. Note that a suffix of ‘0’, ‘1’ or ‘2’ in a name denotes something to do with ‘Training’, ‘Validation’ or ‘Test’ data (mirroring the values generated in the random indicator column that JMP makes).

(2) Lines 15-29: A function that is called if you update one of the default proportions, which checks that your new value does indeed lie between 0 and 1.

(3) Lines 31-55: The user interface -- This presents all the columns in the current data table and allows you to select the response column. You can also override the default split proportions.

(4) Lines 58-85: Shows help in a new window if the ‘Help’ button in (3) is pressed.

(5) Lines 88-104: Runs when the ‘OK’ button in (3) is pressed and checks if the three proportions actually specified still sum to unity. If not, (3) is displayed again with the default split proportions.

(6) Lines 107-135: Checks that a response column was actually selected and stops if not. If a response column was selected, finds the number of distinct levels in that column, and if this is greater than 100, gives the option to stop.

(7) Lines 138-182: Using a mapping of which rows of the current table correspond to which level of the response column (6) and the specified split proportions (3), builds the new validation column. Then uses Distribution to show how the rows in the table were randomly split.

For those who are already familiar with JSL, the code here is unremarkable. At a higher level, though, there are always design decisions that have to be made, and compromise is usually required when one bears in mind the final result has to be "just good enough" for its intended use.

In this case, for example, a specific choice was made to separate the error-checking into two parts -- (2) and (5) above -- to simplify the logic behind the user interface. Even if you are new to JSL, with a little patience and the help of the JSL Guide (‘Help > Books > Scripting Guide’), you should be able to figure out the details of how each section of code actually works. After all, one of the best ways to start to learn a new language is to start from something that works. For reasons of space, we will only look here at (6) and (7).

As mentioned already, one of the key parts in this problem is to find out about the levels in the response column (6). Although there are many ways to do this, the associative array is the most versatile.

The code fragment shown below makes a table to play with. Line 5 gets the 20 values of ‘Response’ into a vector, and line 6 builds an associative array and sends it a ‘Get Keys’ message to reveal the distinct values in lexicographic ordering. One advantage of the associative array is that it works with any data type (not just numbers, as in this case), but note that ‘Get Keys’ always returns a list. So in line 7, we need to use ‘N Items()’ to find out how many items this list contains.

Once we have values for ‘yLev’ and ‘nLevs’ from (6), the steps in (7) are as follows:

1. Build a vector of missing values, ‘vVals’ that will later be used to populate the new validation column (line 141).

2. Loop over the distinct levels held in ‘yLev’ (lines 143 to 161).

3. Locate the row numbers in ‘yVals’ (and therefore the data table) that correspond to the current level and store these in the variable ‘stratum’ (line 145). Note here that the syntax to do this depends on whether ‘yVals’ is a vector or a list (and therefore on whether the original response column had a data type of ‘Numeric’ or ‘Character’).

4. The variable ‘stratumN’ holds the number of rows that correspond to the current level (line 147). Convert the specified split proportions to integers, and make sure these sum to ‘stratumN’ (lines 149 to 154).

5. Build vectors ‘v0’, ‘v1’ and ‘v2’ of the correct lengths and containing the values ‘0’, ‘1’ and ‘2’ respectively (lines 157 to 159). Note that to ensure that the subsequent step always works, we have to allow for the possibility that one of the proportions is 0.0 and set the corresponding vector to ‘[]’.

6. Stack these vectors on end (using ‘VConCat()’), and shuffle the result so that the ‘0’, ‘1’ and ‘2’ values are randomly distributed. Assign these values to the correct rows in ‘vVals’ so that the missing values therein are replaced (line 160).

7. After the loop is finished, use ‘vVals’ to build a new validation column or replace an existing validation column (lines 163 to 170). Note that we use a ‘Value Label’ column property so that the integer values of ‘0’, ‘1’ and ‘2’ appear as ‘Training’, ‘Validation’ and ‘Test’ respectively.

8. Use ‘Distribution’ with a ‘By’ variable to show the values of the validation column for each level in the response.

Although routine, there are perhaps a few non-obvious aspects to the code:

• We did not make any unnecessary assumptions about the data type or modeling type of the response column you select (the code would work perfectly well, for example, with a ‘Numeric’, ‘Continuous’ column that has values that happen to be ‘clumped’ into levels).

• Missing values in the response column propagate as missing values to the validation column that is built, which is desirable. But note that, in the case of a Numeric column, the ‘.’ value will cause a benign JMP Alert when the Distribution platform attempts to process the ‘By’ group with the ‘.’ values. For a Character column, the null string “” is processed just like the other levels in the ‘By’ group with no alert.

• The ‘ColListBox()’ expression in line 42 does not present Row State columns for selection even if they exist in the table, which is desirable in this case.

If you use JMP and like to tinker with code, then there are many ways to learn more. You could start by looking at this File Exchange entry.