Solved: Re: Random subset selection for a given distribution

altug_bayram · Oct 15, 2018 03:33 PM

Hello ...

I am trying to randomly select from a column of data for a given normal distribution . The given distribution is different than that of the data.

Is there an easy way of accomplishing this with Table Subset's current options ? I could not see anything that could help me.

I also tried a formula, basically NormalDensity (X, mu, sigma) to plot the densities of the "data" against the given distribution.

But I still don't know how to say uniformly & randomly select from this formula so my outcome is as close as possible to the given distribution.

Thx. Appreciate the support.

ian_jmp · Oct 17, 2018 06:08 AM

If "So bottom line, I need to sub-select from 400K+ rows a number of sub-samples so that the distribution for the selected sub-sample is as close as possible to the given distribution" is what you want, maybe you could use this kind of approach, or something close to it?

NamesDefaultToHere(1);

// Make a big table
dt1 = NewTable("Big", NewColumn("Values", Numeric, Continuous, Formula(RandomUniform(-3,3))));
dt1 << addRows(40000);
dt1 << runFormulas;
Column(dt1, "Values") << deleteFormula;

// Make a small table
dt2 = NewTable("Small", NewColumn("Values", Numeric, Continuous, Formula(RandomNormal(0,1))));
dt2 << addRows(40);
dt2 << runFormulas;
Column(dt2, "Values") << deleteFormula;

// For each value in dt2, find a value in dt1 that's closest
bv = Column(dt1, "Values") << getValues;
sv = Column(dt2, "Values") << getValues;
selectedRows = J(NRow(sv), 1, .);
for(i=1, i<=NRow(sv), i++,
	dist = (bv - sv[i])^2;
	selectedRows[i] = LocMin(dist);
);

// Subset dt1
dt3 = dt1 << subset(rows(selectedRows), LinkToOriginalDataTable(1));

// Look at the distributions
NewWindow("Compare Distributions",
	HListBox(
		dt1 << Distribution( Continuous Distribution( Column( :Values ) ) ),
		dt2 << Distribution( Continuous Distribution( Column( :Values ) ) ),
		dt3 << Distribution( Continuous Distribution( Column( :Values ) ) )
		)
	);

View solution in original post

Jeff_Perkinson · Oct 17, 2018 05:58 PM

The approach that @ian_jmp found was the first thing I tried, but I abandoned it too early apparently as Ian found a way to overcome some shortcomings that I ran into. Nice job.

As for eliminating resampling of rows, I believe that Ian was suggesting adding the line:

	bv[Loc Min( dist )] = .;

to the for() loop. Thusly:

Names Default To Here( 1 );

// Make a big table
dt1 = New Table( "Big",
	New Column( "Values", Numeric, Continuous, Formula( Random Uniform( -3, 3 ) ) ),
	New Column("Row", Formula(Row()))
);
dt1 << addRows( 40000 );
dt1 << runFormulas;
Column( dt1, "Values" ) << deleteFormula;

// Make a small table
dt2 = New Table( "Small",
	New Column( "Values", Numeric, Continuous, Formula( Random Normal( 0, 1 ) ) )
);
dt2 << addRows( 40 );
dt2 << runFormulas;
Column( dt2, "Values" ) << deleteFormula;

// For each value in dt2, find a value in dt1 that's closest
bv = Column( dt1, "Values" ) << getValues;
sv = Column( dt2, "Values" ) << getValues;
selectedRows = J( N Row( sv ), 1, . );
For( i = 1, i <= N Row( sv ), i++,
	dist = (bv - sv[i]) ^ 2;
	selectedRows[i] = Loc Min( dist );
	//the line below prevents rows from the "Big" table from being re-sampled
	bv[Loc Min( dist )] = .;
);

// Subset dt1
dt3 = dt1 << subset( rows( selectedRows ), LinkToOriginalDataTable( 1 ) );

// Look at the distributions
New Window( "Compare Distributions",
	H List Box(
		dt1 << Distribution( Continuous Distribution( Column( :Values ) ) ),
		dt2 << Distribution( Continuous Distribution( Column( :Values ) ) ),
		dt3 << Distribution( Continuous Distribution( Column( :Values ) ) )
	)
);

-Jeff

View solution in original post

Peter_Bartell · Oct 15, 2018 03:44 PM

Once you have your population column from which you want to sample, try out, this workflow:

1. Select the column from which you want to sample.

2. From the JMP main menu bar, select Rows -> Row Selection -> Select Randomly -> and then your sample size or sampling rate.

This will return selected columns in your original data table, and from there Row Subset will give you a data table.

altug_bayram · Oct 15, 2018 03:57 PM

Peter .. thanks for your suggestion.
but that does not seem to work.

My original data's mean/stdev are: 12816 and 1726 from a very large sample (400K +).
I am trying to downselect a sub-sample that is closer to 14300 and 2050

I tried your suggestion on the formula (i.e. densities of the data wrt 14300, 2050) and unless I can force this random dist to be uniformly selected among the densities, I don't think I can get what I need. In fact, sub-sampling came out as expected w/ the original distribution.

Peter_Bartell · Oct 15, 2018 04:22 PM

I see...I did not fully understand your original post. Your right my suggestion will just return a distribution with very similar parameters to the original distribution. If you want observations of a random variable with the desired parameter estimates, why can't you just create a column with those parameters and move on. Sure, the specific observations won't necessarily reside in the original column, but that may or may not influence what you're trying to acheive analytically or practically? They'll at least be plausible observations, though not exact matches.

altug_bayram · Oct 15, 2018 04:44 PM

My real goal is to compare N (other) parameters from two data sets already available, I am not trying to create a fictitious distribution. (Maybe I am misunderstanding your proposal).

The given distribution is a constraint on an important parameter that may influences others. I am trying to select assets w/ similar distribution from "data" to a given distribution so to minimize the impact of this influential parameter. In this context, "distribution" only applies for a single parameter, and not multiple.

So bottom line, I need to sub-select from 400K+ rows a number of sub-samples so that the distribution for the selected sub-sample is as close as possible to the given distribution.

txnelson · Oct 15, 2018 05:04 PM

I believe that you can get what you want by doing a simple 2 step process.

1. Open the Distribution Platform for the column you have in question. Then

Save==>Level Numbers.

This will create a new column with a distribution taken directly from the actual data

2. Create a subset of the data and specify

Random - sampling rate:

and also

Stratify, based upon the new Level column created in the step above

Jim

altug_bayram · Oct 15, 2018 06:15 PM

Jim, thanks for your suggestion.

But your proposal does not have a way to force select according for the given distribution. So it wouldn't produce what I am looking for. To validate, I did follow the steps and it produced a dist very similar to the original one.

Maybe I am not explaning in a clear way.

I have 400k+ rows X N columns of data (huge matrix) . If I call one of N params, as "X" for argument sake.

X's distribution from this data - call it X.dist1.

I have a second dataset which has Xagain, and this time w/ X.dist2

X.dist1 is not too far but is different than X.dist2.

Given the fact that I have plenty of data in the first dataset (400k+), I am hence trying to sub-select data set from 1 so that selected data X distribution is the closest I can get to X.dist2.

Hence I was looking for some capabiltiy under Table - Subset where I can specify normal distribution characteristic OR using a formula and then using Table- subset.

for the formula, I put densities of dataset1 against my desired X.dist2.

But now I need a uniform random way of sampling from the formulae outputs so to force the outcome distribution to be ~ X.dist2

OR any any other way ...

If I am to use a Table-Subset - the solution needs to specify a normal distribution from a parameter to pull from.

Or otherwise, I need to find a uniform random selection from a parameter (formula approach).

txnelson · Oct 15, 2018 07:56 PM

I am just free thinking here, but maybe what you could try, is to determine the distribution shape, using the Distribution Platform, then transform the data to a normal distribution. Then do the outputing of the Levels and then the subsetting with stratification. Just a thought.

Jim

Jeff_Perkinson · Oct 16, 2018 05:54 PM

Depending on your purpose getting a subset like this may be statistically questionable but one way to do it is to brute force it by using Resample Freq() to generate random samples and then checking that sample to see if it meets your desired characteristics. Here's a prototype that generates a table to simulate your original data and then resamples it, uses the Distribution platform to test the sample.

dt = New Table( "original",
	Add Rows( 1000 ), 
	//:parameter will have a random normal distribution with mean=0
	//and a stddev=1
	New Column( "parameter",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Formula( Random Normal() )
	),
	New Column( "Row", Numeric, "Continuous", Format( "Best", 12 ), Formula( Row() ) ), 
	//The Resample Freq() function generates a frequence count for sample with replacement
	//the .05 argument is the sample rate, i.e. we want a sample size of 5% of the original table
	New Column( "Resample Freq",
		Numeric,
		"Continuous",
		Format( "Best", 12 ),
		Formula( Resample Freq( 0.05 ) ), 

	)
);

//Generate a distribution of :parameter using the resample frequency as a Freq column to get
//Automatic Recalc will cause the Distribution to rerun when the Resample Freq column is re-evaluated
//to get a new sample
New Window( "Dist comparison",
	H List Box(
		dt << Distribution( Continuous Distribution( Column( :parameter ) ) ), 	

		dist = dt << Distribution(
			Automatic Recalc( 1 ),
			Freq( :Resample Freq ),
			Continuous Distribution( Column( :parameter ) )
		)
	)
);


//set the minimum mean value we will accept for the subsample
target_mean_min = .249;

//set the minimum mean value we will accept for the subsample
target_mean_max = .251;

//how many iterations are we willing to try before giving up?
max_iter = 1000;

iter = 1;

While( 
	//Report( dist )[Number Col Box( 2 )][1] is the mean value from the Distribution report
	!(target_mean_min <= Report( dist )[Number Col Box( 2 )][1] <= target_mean_max) &
	iter <= max_iter, 
	//get a new sample 
	Column( "Resample Freq" ) << eval formula;
	iter++;
);

If( 
	//check to see if we exited because we hit max_iter or got a good sample 
	iter > max_iter & !(target_mean_min <= Report( dist )[Number Col Box( 2 )][1] <=
	target_mean_max)
,
	Print( "No result in " || Char( max_iter ) || " random samples." ), 
	//got a good sample so get the subset
	dt << select where( :resample freq != 0 );
	dt << Subset( Selected Rows( 1 ), Selected columns only( 0 ) );
);

-Jeff

ian_jmp · Oct 17, 2018 06:08 AM

If "So bottom line, I need to sub-select from 400K+ rows a number of sub-samples so that the distribution for the selected sub-sample is as close as possible to the given distribution" is what you want, maybe you could use this kind of approach, or something close to it?

NamesDefaultToHere(1);

// Make a big table
dt1 = NewTable("Big", NewColumn("Values", Numeric, Continuous, Formula(RandomUniform(-3,3))));
dt1 << addRows(40000);
dt1 << runFormulas;
Column(dt1, "Values") << deleteFormula;

// Make a small table
dt2 = NewTable("Small", NewColumn("Values", Numeric, Continuous, Formula(RandomNormal(0,1))));
dt2 << addRows(40);
dt2 << runFormulas;
Column(dt2, "Values") << deleteFormula;

// For each value in dt2, find a value in dt1 that's closest
bv = Column(dt1, "Values") << getValues;
sv = Column(dt2, "Values") << getValues;
selectedRows = J(NRow(sv), 1, .);
for(i=1, i<=NRow(sv), i++,
	dist = (bv - sv[i])^2;
	selectedRows[i] = LocMin(dist);
);

// Subset dt1
dt3 = dt1 << subset(rows(selectedRows), LinkToOriginalDataTable(1));

// Look at the distributions
NewWindow("Compare Distributions",
	HListBox(
		dt1 << Distribution( Continuous Distribution( Column( :Values ) ) ),
		dt2 << Distribution( Continuous Distribution( Column( :Values ) ) ),
		dt3 << Distribution( Continuous Distribution( Column( :Values ) ) )
		)
	);