cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
JMP is taking Discovery online, April 16 and 18. Register today and join us for interactive sessions featuring popular presentation topics, networking, and discussions with the experts.
Choose Language Hide Translation Bar
altug_bayram
Level IV

Random subset selection for a given distribution

Hello ... 

 

I am trying to randomly select from a column of data for a given normal distribution . The given distribution is different than that of the data. 

Is there an easy way of accomplishing this with Table Subset's current options ? I could not see anything that could help me. 

 

I also tried a formula, basically NormalDensity (X, mu, sigma) to plot the densities of the "data" against the given distribution. 

But I still don't know how to say uniformly & randomly select from this formula so my outcome is as close as possible to the given distribution. 

Thx. Appreciate the support.

22 REPLIES 22
altug_bayram
Level IV

Re: Random subset selection for a given distribution

Ian, 

This is brilliant. I sort of thought of a similar approach but was unable to do so due to my basic script knowledge. 

I was able to adapt it to my data and am getting what I want. I may perhaps expand this to 2-parameter distribution selection which I should be able to do from your script. 

 

thanks for your support.

altug_bayram
Level IV

Re: Random subset selection for a given distribution

Ian, 

Forgot to ask one question. Is your selection done uniquely from original table or does this have a chance to have duplicate selections into the sub-table ?

thx

ian_jmp
Staff

Re: Random subset selection for a given distribution

Yes, I wondered about this. Right now it does, but if you don’t want duplicates, just add an extra line that sets the element of the vector ‘bv’ that has been found to missing ‘.’.
altug_bayram
Level IV

Re: Random subset selection for a given distribution

Ian

I was aiming to get ~46K of sub-sample data, of which 44K have been duplicates.

Could you pls write me that line for "bv" for your suggestion ?

 

thx again. 

ian_jmp
Staff

Re: Random subset selection for a given distribution

I’m sorry - I won’t have access to a computer for a few days due to a vacation. But I’m sure someone else can do it . . .
Jeff_Perkinson
Community Manager Community Manager

Re: Random subset selection for a given distribution

The approach that @ian_jmp found was the first thing I tried, but I abandoned it too early apparently as Ian found a way to overcome some shortcomings that I ran into. Nice job.

 

As for eliminating resampling of rows, I believe that Ian was suggesting adding the line:

 

	bv[Loc Min( dist )] = .;

to the for() loop. Thusly:

 

Names Default To Here( 1 );

// Make a big table
dt1 = New Table( "Big",
	New Column( "Values", Numeric, Continuous, Formula( Random Uniform( -3, 3 ) ) ),
	New Column("Row", Formula(Row()))
);
dt1 << addRows( 40000 );
dt1 << runFormulas;
Column( dt1, "Values" ) << deleteFormula;

// Make a small table
dt2 = New Table( "Small",
	New Column( "Values", Numeric, Continuous, Formula( Random Normal( 0, 1 ) ) )
);
dt2 << addRows( 40 );
dt2 << runFormulas;
Column( dt2, "Values" ) << deleteFormula;

// For each value in dt2, find a value in dt1 that's closest
bv = Column( dt1, "Values" ) << getValues;
sv = Column( dt2, "Values" ) << getValues;
selectedRows = J( N Row( sv ), 1, . );
For( i = 1, i <= N Row( sv ), i++,
	dist = (bv - sv[i]) ^ 2;
	selectedRows[i] = Loc Min( dist );
	//the line below prevents rows from the "Big" table from being re-sampled
	bv[Loc Min( dist )] = .;
);

// Subset dt1
dt3 = dt1 << subset( rows( selectedRows ), LinkToOriginalDataTable( 1 ) );

// Look at the distributions
New Window( "Compare Distributions",
	H List Box(
		dt1 << Distribution( Continuous Distribution( Column( :Values ) ) ),
		dt2 << Distribution( Continuous Distribution( Column( :Values ) ) ),
		dt3 << Distribution( Continuous Distribution( Column( :Values ) ) )
	)
);
-Jeff
altug_bayram
Level IV

Re: Random subset selection for a given distribution

Jeff 

Your modification on Ian's original solution worked perfectly. I can now extract random data w/ a prescribed distribution. It would be nice to see this as an option under Table Subsetting ...

My problem is 100% solved. 

 

thanks for your support.

altug_bayram
Level IV

Re: Random subset selection for a given distribution

Jeff, 

Thanks for your effort. 

Your suggestion had a slight chance. I applied it to my data and unfortunately since the selection is random, it did not work. I ran it for about 3000 iterations and stopped it after 10 hours. The movement on the mean of data was insignifcant. 

 

This isn't a simple technique (unless your name is Ian) .. I would imagine some sort of optimizer being involved trying to random pull from the direction that it makes more sense. GA's may be an overkill but certainly can do this kind of optimization. 

 

From statistical perspective, this is perfectly ok for what I do. The 400+K represents a big data of assets.. I am trying to find assets that are similar to a sub-group under investigation. 

 

thanks again.

Re: Random subset selection for a given distribution

You have got to be kidding me.  Getting a random sample from another column, and showing it to you in a separate column, is a menu selection in Minitab.  The answer to this took two pages of discussion and dozens of lines of code.  Why are simple things so difficult in JMP?

Jeff_Perkinson
Community Manager Community Manager

Re: Random subset selection for a given distribution

The original request here was not for simple random sampling of another column.

 

If that's all you need you can use the Col Shuffle() function to generate a new column with random selection from another.

 

2020-01-17_09-59-27.336.png

-Jeff