topic Re: Random subset selection for a given distribution in Discussions

Random subset selection for a given distribution

altug_bayram — Mon, 15 Oct 2018 19:33:28 GMT

Hello ...

I am trying to randomly select from a column of data for a given normal distribution . The given distribution is different than that of the data.

Is there an easy way of accomplishing this with Table Subset's current options ? I could not see anything that could help me.

I also tried a formula, basically NormalDensity (X, mu, sigma) to plot the densities of the "data" against the given distribution.

But I still don't know how to say uniformly & randomly select from this formula so my outcome is as close as possible to the given distribution.

Thx. Appreciate the support.

Re: Random subset selection for a given distribution

Peter_Bartell — Mon, 15 Oct 2018 19:44:42 GMT

Once you have your population column from which you want to sample, try out, this workflow:

1. Select the column from which you want to sample.

2. From the JMP main menu bar, select Rows -> Row Selection -> Select Randomly -> and then your sample size or sampling rate.

This will return selected columns in your original data table, and from there Row Subset will give you a data table.

Re: Random subset selection for a given distribution

altug_bayram — Mon, 15 Oct 2018 19:57:22 GMT

Peter .. thanks for your suggestion.
but that does not seem to work.

My original data's mean/stdev are: 12816 and 1726 from a very large sample (400K +).
I am trying to downselect a sub-sample that is closer to 14300 and 2050

I tried your suggestion on the formula (i.e. densities of the data wrt 14300, 2050) and unless I can force this random dist to be uniformly selected among the densities, I don't think I can get what I need. In fact, sub-sampling came out as expected w/ the original distribution.

Re: Random subset selection for a given distribution

Peter_Bartell — Mon, 15 Oct 2018 20:22:35 GMT

I see...I did not fully understand your original post. Your right my suggestion will just return a distribution with very similar parameters to the original distribution. If you want observations of a random variable with the desired parameter estimates, why can't you just create a column with those parameters and move on. Sure, the specific observations won't necessarily reside in the original column, but that may or may not influence what you're trying to acheive analytically or practically? They'll at least be plausible observations, though not exact matches.

Re: Random subset selection for a given distribution

altug_bayram — Mon, 15 Oct 2018 20:44:03 GMT

My real goal is to compare N (other) parameters from two data sets already available, I am not trying to create a fictitious distribution. (Maybe I am misunderstanding your proposal).

The given distribution is a constraint on an important parameter that may influences others. I am trying to select assets w/ similar distribution from "data" to a given distribution so to minimize the impact of this influential parameter. In this context, "distribution" only applies for a single parameter, and not multiple.

So bottom line, I need to sub-select from 400K+ rows a number of sub-samples so that the distribution for the selected sub-sample is as close as possible to the given distribution.

Re: Random subset selection for a given distribution

txnelson — Mon, 15 Oct 2018 21:04:43 GMT

I believe that you can get what you want by doing a simple 2 step process.

1. Open the Distribution Platform for the column you have in question. Then

Save==>Level Numbers.

This will create a new column with a distribution taken directly from the actual data

2. Create a subset of the data and specify

Random - sampling rate:

and also

Stratify, based upon the new Level column created in the step above

Re: Random subset selection for a given distribution

altug_bayram — Mon, 15 Oct 2018 22:15:48 GMT

Jim, thanks for your suggestion.

But your proposal does not have a way to force select according for the given distribution. So it wouldn't produce what I am looking for. To validate, I did follow the steps and it produced a dist very similar to the original one.

Maybe I am not explaning in a clear way.

I have 400k+ rows X N columns of data (huge matrix) . If I call one of N params, as "X" for argument sake.

X's distribution from this data - call it X.dist1.

I have a second dataset which has Xagain, and this time w/ X.dist2

X.dist1 is not too far but is different than X.dist2.

Given the fact that I have plenty of data in the first dataset (400k+), I am hence trying to sub-select data set from 1 so that selected data X distribution is the closest I can get to X.dist2.

Hence I was looking for some capabiltiy under Table - Subset where I can specify normal distribution characteristic OR using a formula and then using Table- subset.

for the formula, I put densities of dataset1 against my desired X.dist2.

But now I need a uniform random way of sampling from the formulae outputs so to force the outcome distribution to be ~ X.dist2

OR any any other way ...

If I am to use a Table-Subset - the solution needs to specify a normal distribution from a parameter to pull from.

Or otherwise, I need to find a uniform random selection from a parameter (formula approach).

Re: Random subset selection for a given distribution

txnelson — Mon, 15 Oct 2018 23:56:13 GMT

I am just free thinking here, but maybe what you could try, is to determine the distribution shape, using the Distribution Platform, then transform the data to a normal distribution. Then do the outputing of the Levels and then the subsetting with stratification. Just a thought.

Re: Random subset selection for a given distribution

Jeff_Perkinson — Tue, 16 Oct 2018 21:54:32 GMT

Depending on your purpose getting a subset like this may be statistically questionable but one way to do it is to brute force it by using Resample Freq() to generate random samples and then checking that sample to see if it meets your desired characteristics. Here's a prototype that generates a table to simulate your original data and then resamples it, uses the Distribution platform to test the sample.

dt = New Table( "original",
	Add Rows( 1000 ), 
	//:parameter will have a random normal distribution with mean=0
	//and a stddev=1
	New Column( "parameter",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Formula( Random Normal() )
	),
	New Column( "Row", Numeric, "Continuous", Format( "Best", 12 ), Formula( Row() ) ), 
	//The Resample Freq() function generates a frequence count for sample with replacement
	//the .05 argument is the sample rate, i.e. we want a sample size of 5% of the original table
	New Column( "Resample Freq",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Formula( Resample Freq( 0.05 ) ), 

	)
);

//Generate a distribution of :parameter using the resample frequency as a Freq column to get
//Automatic Recalc will cause the Distribution to rerun when the Resample Freq column is re-evaluated
//to get a new sample
New Window( "Dist comparison",
	H List Box(
		dt << Distribution( Continuous Distribution( Column( :parameter ) ) ), 	

		dist = dt << Distribution(
			Automatic Recalc( 1 ),
			Freq( :Resample Freq ),
			Continuous Distribution( Column( :parameter ) )
		)
	)
);


//set the minimum mean value we will accept for the subsample
target_mean_min = .249;

//set the minimum mean value we will accept for the subsample
target_mean_max = .251;

//how many iterations are we willing to try before giving up?
max_iter = 1000;

iter = 1;

While( 
	//Report( dist )[Number Col Box( 2 )][1] is the mean value from the Distribution report
	!(target_mean_min <= Report( dist )[Number Col Box( 2 )][1] <= target_mean_max) &
	iter <= max_iter, 
	//get a new sample 
	Column( "Resample Freq" ) << eval formula;
	iter++;
);

If( 
	//check to see if we exited because we hit max_iter or got a good sample 
	iter > max_iter & !(target_mean_min <= Report( dist )[Number Col Box( 2 )][1] <=
	target_mean_max)
,
	Print( "No result in " || Char( max_iter ) || " random samples." ), 
	//got a good sample so get the subset
	dt << select where( :resample freq != 0 );
	dt << Subset( Selected Rows( 1 ), Selected columns only( 0 ) );
);

Re: Random subset selection for a given distribution

ian_jmp — Wed, 17 Oct 2018 10:08:51 GMT

If "So bottom line, I need to sub-select from 400K+ rows a number of sub-samples so that the distribution for the selected sub-sample is as close as possible to the given distribution" is what you want, maybe you could use this kind of approach, or something close to it?

NamesDefaultToHere(1);

// Make a big table
dt1 = NewTable("Big", NewColumn("Values", Numeric, Continuous, Formula(RandomUniform(-3,3))));
dt1 << addRows(40000);
dt1 << runFormulas;
Column(dt1, "Values") << deleteFormula;

// Make a small table
dt2 = NewTable("Small", NewColumn("Values", Numeric, Continuous, Formula(RandomNormal(0,1))));
dt2 << addRows(40);
dt2 << runFormulas;
Column(dt2, "Values") << deleteFormula;

// For each value in dt2, find a value in dt1 that's closest
bv = Column(dt1, "Values") << getValues;
sv = Column(dt2, "Values") << getValues;
selectedRows = J(NRow(sv), 1, .);
for(i=1, i<=NRow(sv), i++,
	dist = (bv - sv[i])^2;
	selectedRows[i] = LocMin(dist);
);

// Subset dt1
dt3 = dt1 << subset(rows(selectedRows), LinkToOriginalDataTable(1));

// Look at the distributions
NewWindow("Compare Distributions",
	HListBox(
		dt1 << Distribution( Continuous Distribution( Column( :Values ) ) ),
		dt2 << Distribution( Continuous Distribution( Column( :Values ) ) ),
		dt3 << Distribution( Continuous Distribution( Column( :Values ) ) )
		)
	);

Re: Random subset selection for a given distribution

altug_bayram — Wed, 17 Oct 2018 17:12:13 GMT

Jeff,

Thanks for your effort.

Your suggestion had a slight chance. I applied it to my data and unfortunately since the selection is random, it did not work. I ran it for about 3000 iterations and stopped it after 10 hours. The movement on the mean of data was insignifcant.

This isn't a simple technique (unless your name is Ian) .. I would imagine some sort of optimizer being involved trying to random pull from the direction that it makes more sense. GA's may be an overkill but certainly can do this kind of optimization.

From statistical perspective, this is perfectly ok for what I do. The 400+K represents a big data of assets.. I am trying to find assets that are similar to a sub-group under investigation.

thanks again.

Re: Random subset selection for a given distribution

altug_bayram — Wed, 17 Oct 2018 17:14:56 GMT

Ian,

This is brilliant. I sort of thought of a similar approach but was unable to do so due to my basic script knowledge.

I was able to adapt it to my data and am getting what I want. I may perhaps expand this to 2-parameter distribution selection which I should be able to do from your script.

thanks for your support.

Re: Random subset selection for a given distribution

altug_bayram — Wed, 17 Oct 2018 17:19:12 GMT

Ian,

Forgot to ask one question. Is your selection done uniquely from original table or does this have a chance to have duplicate selections into the sub-table ?

thx

Re: Random subset selection for a given distribution

ian_jmp — Wed, 17 Oct 2018 17:38:58 GMT

Yes, I wondered about this. Right now it does, but if you don’t want duplicates, just add an extra line that sets the element of the vector ‘bv’ that has been found to missing ‘.’.

Re: Random subset selection for a given distribution

altug_bayram — Wed, 17 Oct 2018 18:25:49 GMT

Ian

I was aiming to get ~46K of sub-sample data, of which 44K have been duplicates.

Could you pls write me that line for "bv" for your suggestion ?

thx again.

Re: Random subset selection for a given distribution

ian_jmp — Wed, 17 Oct 2018 18:33:58 GMT

I’m sorry - I won’t have access to a computer for a few days due to a vacation. But I’m sure someone else can do it . . .

Re: Random subset selection for a given distribution

Jeff_Perkinson — Wed, 17 Oct 2018 21:58:31 GMT

The approach that @ian_jmp found was the first thing I tried, but I abandoned it too early apparently as Ian found a way to overcome some shortcomings that I ran into. Nice job.

As for eliminating resampling of rows, I believe that Ian was suggesting adding the line:

	bv[Loc Min( dist )] = .;

to the for() loop. Thusly:

Names Default To Here( 1 );

// Make a big table
dt1 = New Table( "Big",
	New Column( "Values", Numeric, Continuous, Formula( Random Uniform( -3, 3 ) ) ),
	New Column("Row", Formula(Row()))
);
dt1 << addRows( 40000 );
dt1 << runFormulas;
Column( dt1, "Values" ) << deleteFormula;

// Make a small table
dt2 = New Table( "Small",
	New Column( "Values", Numeric, Continuous, Formula( Random Normal( 0, 1 ) ) )
);
dt2 << addRows( 40 );
dt2 << runFormulas;
Column( dt2, "Values" ) << deleteFormula;

// For each value in dt2, find a value in dt1 that's closest
bv = Column( dt1, "Values" ) << getValues;
sv = Column( dt2, "Values" ) << getValues;
selectedRows = J( N Row( sv ), 1, . );
For( i = 1, i <= N Row( sv ), i++,
	dist = (bv - sv[i]) ^ 2;
	selectedRows[i] = Loc Min( dist );
	//the line below prevents rows from the "Big" table from being re-sampled
	bv[Loc Min( dist )] = .;
);

// Subset dt1
dt3 = dt1 << subset( rows( selectedRows ), LinkToOriginalDataTable( 1 ) );

// Look at the distributions
New Window( "Compare Distributions",
	H List Box(
		dt1 << Distribution( Continuous Distribution( Column( :Values ) ) ),
		dt2 << Distribution( Continuous Distribution( Column( :Values ) ) ),
		dt3 << Distribution( Continuous Distribution( Column( :Values ) ) )
	)
);

Re: Random subset selection for a given distribution

altug_bayram — Thu, 18 Oct 2018 01:51:27 GMT

Jeff

Your modification on Ian's original solution worked perfectly. I can now extract random data w/ a prescribed distribution. It would be nice to see this as an option under Table Subsetting ...

My problem is 100% solved.

thanks for your support.

Re: Random subset selection for a given distribution

StevenJLandry — Fri, 17 Jan 2020 13:59:28 GMT

You have got to be kidding me. Getting a random sample from another column, and showing it to you in a separate column, is a menu selection in Minitab. The answer to this took two pages of discussion and dozens of lines of code. Why are simple things so difficult in JMP?

Re: Random subset selection for a given distribution

Jeff_Perkinson — Fri, 17 Jan 2020 15:00:10 GMT

The original request here was not for simple random sampling of another column.

If that's all you need you can use the Col Shuffle() function to generate a new column with random selection from another.