Re: How to make a validation column in regular JMP using JSL?

shasheminassab · Jun 9, 2023 12:50 PM

Hi

I think there is a "Make Validation Column" function in JMP pro but I am wondering how to make a validation column (with 25% training and 75% validation) using JSL in regular JMP? Any help is appreciated.

Dan_Obermiller · Jun 18, 2021 7:05 AM

It is pretty simple to do interactively. The JSL steps should be straight-forward using this approach, too.

Create a columns called Validation.

Fill the entire column with 0s. A zero will indicate a training set observation.

Go to Tables > Subset.

Enter a Random-Sampling Rate of 0.75 (for your 75% validation set).

Check the box for Link to Original Data Table.

Click OK.

In the subset table, change a validation field from 0 to 1.

Right-click the 1 and Fill to end of Table.

Close the subset table.

I was pretty loose with my JSL (didn't bother with proper scoping to avoid potential issues), but it should give you a pretty good start.

dt=Current Data Table();
dt << New Column( "Validation",
	Numeric,
	"Nominal",
	Format( "Best", 12 )
);
For Each Row (:Validation = 0);

dt << Subset(
	Output Table("ValData"),
	Linked,
	Suppress formula evaluation( 0 ),
	Sampling Rate( 0.75 ),
	Selected columns only( 0 )
);
For Each Row (:Validation = 1);

Close( "ValData" );

dt << Clear Select;

Note that this is going strictly with a random assignment. Many times you really should stratify the validation by the target variable.

My approach was something I had kept hidden away for several years. Brady has two better approaches down below.

Dan Obermiller

brady_brady · Jun 17, 2021 2:54 PM

Hi,

Subject to Dan's caveats regarding random assignment, this will do it (in this case, for 75% training data).

dt << new column ("Validation", nominal, <<set values(randomshuffle( (1::nrow(dt))` > 0.75*nrow(dt))));

Why this works:

1) The (1::nrow)`piece creates a column vector [1 2 3 ... nrow(dt)], and transposes (using the ` operator) it into a row vector [1,2,3, ... nrow(dt)].

2) Then, this row vector is compared to 0.75*nrow(dt). If greater, assign 1, if not, assign 0. So, suppose we have nrow(dt) = 100. Then the original vector is:

[1, 2, 3, ... , 74, 75, 76, 77, ... 100]. After the comparison with 75, the result vector is:

[0, 0, 0, ... , 0, 0, 1, 1, ... 1]. That is, 75 0s followed by 25 1s.

3) Randomshuffle ( ) puts the contents of a vector into random order... so the 75 0s and 25 1s (still using a 100-row table as an example) will be encountered in random fashion.

4) Finally, the << set values message fills the column with this random assortment of 75 0s and 25 1s.

FWIW, another way to do this interactively is to select Cols > New Columns... from the main menu, then fill out the dialog as below:

Cheers,

Brady