Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

- JMP User Community
- :
- Discussions
- :
- Random subset selection for a given distribution

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Oct 15, 2018 12:33 PM
(6651 views)

Hello ...

I am trying to randomly select from a column of data for a given normal distribution . The given distribution is different than that of the data.

Is there an easy way of accomplishing this with Table Subset's current options ? I could not see anything that could help me.

I also tried a formula, basically NormalDensity (X, mu, sigma) to plot the densities of the "data" against the given distribution.

But I still don't know how to say uniformly & randomly select from this formula so my outcome is as close as possible to the given distribution.

Thx. Appreciate the support.

2 ACCEPTED SOLUTIONS

Accepted Solutions

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

If "So bottom line, I need to sub-select from 400K+ rows a number of sub-samples so that the distribution for the selected sub-sample is as close as possible to the given distribution" is what you want, maybe you could use this kind of approach, or something close to it?

```
NamesDefaultToHere(1);
// Make a big table
dt1 = NewTable("Big", NewColumn("Values", Numeric, Continuous, Formula(RandomUniform(-3,3))));
dt1 << addRows(40000);
dt1 << runFormulas;
Column(dt1, "Values") << deleteFormula;
// Make a small table
dt2 = NewTable("Small", NewColumn("Values", Numeric, Continuous, Formula(RandomNormal(0,1))));
dt2 << addRows(40);
dt2 << runFormulas;
Column(dt2, "Values") << deleteFormula;
// For each value in dt2, find a value in dt1 that's closest
bv = Column(dt1, "Values") << getValues;
sv = Column(dt2, "Values") << getValues;
selectedRows = J(NRow(sv), 1, .);
for(i=1, i<=NRow(sv), i++,
dist = (bv - sv[i])^2;
selectedRows[i] = LocMin(dist);
);
// Subset dt1
dt3 = dt1 << subset(rows(selectedRows), LinkToOriginalDataTable(1));
// Look at the distributions
NewWindow("Compare Distributions",
HListBox(
dt1 << Distribution( Continuous Distribution( Column( :Values ) ) ),
dt2 << Distribution( Continuous Distribution( Column( :Values ) ) ),
dt3 << Distribution( Continuous Distribution( Column( :Values ) ) )
)
);
```

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

The approach that @ian_jmp found was the first thing I tried, but I abandoned it too early apparently as Ian found a way to overcome some shortcomings that I ran into. Nice job.

As for eliminating resampling of rows, I believe that Ian was suggesting adding the line:

` bv[Loc Min( dist )] = .;`

to the for() loop. Thusly:

```
Names Default To Here( 1 );
// Make a big table
dt1 = New Table( "Big",
New Column( "Values", Numeric, Continuous, Formula( Random Uniform( -3, 3 ) ) ),
New Column("Row", Formula(Row()))
);
dt1 << addRows( 40000 );
dt1 << runFormulas;
Column( dt1, "Values" ) << deleteFormula;
// Make a small table
dt2 = New Table( "Small",
New Column( "Values", Numeric, Continuous, Formula( Random Normal( 0, 1 ) ) )
);
dt2 << addRows( 40 );
dt2 << runFormulas;
Column( dt2, "Values" ) << deleteFormula;
// For each value in dt2, find a value in dt1 that's closest
bv = Column( dt1, "Values" ) << getValues;
sv = Column( dt2, "Values" ) << getValues;
selectedRows = J( N Row( sv ), 1, . );
For( i = 1, i <= N Row( sv ), i++,
dist = (bv - sv[i]) ^ 2;
selectedRows[i] = Loc Min( dist );
//the line below prevents rows from the "Big" table from being re-sampled
bv[Loc Min( dist )] = .;
);
// Subset dt1
dt3 = dt1 << subset( rows( selectedRows ), LinkToOriginalDataTable( 1 ) );
// Look at the distributions
New Window( "Compare Distributions",
H List Box(
dt1 << Distribution( Continuous Distribution( Column( :Values ) ) ),
dt2 << Distribution( Continuous Distribution( Column( :Values ) ) ),
dt3 << Distribution( Continuous Distribution( Column( :Values ) ) )
)
);
```

-Jeff

22 REPLIES 22

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Random subset selection for a given distribution

Once you have your population column from which you want to sample, try out, this workflow:

1. Select the column from which you want to sample.

2. From the JMP main menu bar, select Rows -> Row Selection -> Select Randomly -> and then your sample size or sampling rate.

This will return selected columns in your original data table, and from there Row Subset will give you a data table.

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Random subset selection for a given distribution

Peter .. thanks for your suggestion.

but that does not seem to work.

My original data's mean/stdev are: 12816 and 1726 from a very large sample (400K +).

I am trying to downselect a sub-sample that is closer to 14300 and 2050

I tried your suggestion on the formula (i.e. densities of the data wrt 14300, 2050) and unless I can force this random dist to be uniformly selected among the densities, I don't think I can get what I need. In fact, sub-sampling came out as expected w/ the original distribution.

but that does not seem to work.

My original data's mean/stdev are: 12816 and 1726 from a very large sample (400K +).

I am trying to downselect a sub-sample that is closer to 14300 and 2050

I tried your suggestion on the formula (i.e. densities of the data wrt 14300, 2050) and unless I can force this random dist to be uniformly selected among the densities, I don't think I can get what I need. In fact, sub-sampling came out as expected w/ the original distribution.

Highlighted
##

I see...I did not fully understand your original post. Your right my suggestion will just return a distribution with very similar parameters to the original distribution. If you want observations of a random variable with the desired parameter estimates, why can't you just create a column with those parameters and move on. Sure, the specific observations won't necessarily reside in the original column, but that may or may not influence what you're trying to acheive analytically or practically? They'll at least be plausible observations, though not exact matches.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Random subset selection for a given distribution

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Random subset selection for a given distribution

My real goal is to compare N (other) parameters from two data sets already available, I am not trying to create a fictitious distribution. (Maybe I am misunderstanding your proposal).

The given distribution is a constraint on an important parameter that may influences others. I am trying to select assets w/ similar distribution from "data" to a given distribution so to minimize the impact of this influential parameter. In this context, "distribution" only applies for a single parameter, and not multiple.

So bottom line, I need to sub-select from 400K+ rows a number of sub-samples so that the distribution for the selected sub-sample is as close as possible to the given distribution.

The given distribution is a constraint on an important parameter that may influences others. I am trying to select assets w/ similar distribution from "data" to a given distribution so to minimize the impact of this influential parameter. In this context, "distribution" only applies for a single parameter, and not multiple.

So bottom line, I need to sub-select from 400K+ rows a number of sub-samples so that the distribution for the selected sub-sample is as close as possible to the given distribution.

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Random subset selection for a given distribution

I believe that you can get what you want by doing a simple 2 step process.

1. Open the Distribution Platform for the column you have in question. Then

Save==>Level Numbers.

This will create a new column with a distribution taken directly from the actual data

2. Create a subset of the data and specify

Random - sampling rate:

and also

Stratify, based upon the new Level column created in the step above

Jim

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Random subset selection for a given distribution

Jim, thanks for your suggestion.

But your proposal does not have a way to force select according for the given distribution. So it wouldn't produce what I am looking for. To validate, I did follow the steps and it produced a dist very similar to the original one.

Maybe I am not explaning in a clear way.

I have 400k+ rows X N columns of data (huge matrix) . If I call one of N params, as "X" for argument sake.

X's distribution from this data - call it X.dist1.

I have a second dataset which has Xagain, and this time w/ X.dist2

X.dist1 is not too far but is different than X.dist2.

Given the fact that I have plenty of data in the first dataset (400k+), I am hence trying to sub-select data set from 1 so that selected data X distribution is the closest I can get to X.dist2.

Hence I was looking for some capabiltiy under Table - Subset where I can specify normal distribution characteristic OR using a formula and then using Table- subset.

for the formula, I put densities of dataset1 against my desired X.dist2.

But now I need a uniform random way of sampling from the formulae outputs so to force the outcome distribution to be ~ X.dist2

OR any any other way ...

If I am to use a Table-Subset - the solution needs to specify a normal distribution from a parameter to pull from.

Or otherwise, I need to find a uniform random selection from a parameter (formula approach).

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Random subset selection for a given distribution

I am just free thinking here, but maybe what you could try, is to determine the distribution shape, using the Distribution Platform, then transform the data to a normal distribution. Then do the outputing of the Levels and then the subsetting with stratification. Just a thought.

Jim

Highlighted
##

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

Re: Random subset selection for a given distribution

Depending on your purpose getting a subset like this may be statistically questionable but one way to do it is to brute force it by using Resample Freq() to generate random samples and then checking that sample to see if it meets your desired characteristics. Here's a prototype that generates a table to simulate your original data and then resamples it, uses the Distribution platform to test the sample.

```
dt = New Table( "original",
Add Rows( 1000 ),
//:parameter will have a random normal distribution with mean=0
//and a stddev=1
New Column( "parameter",
Numeric,
"Continuous",
Format( "Best", 12 ),
Formula( Random Normal() )
),
New Column( "Row", Numeric, "Continuous", Format( "Best", 12 ), Formula( Row() ) ),
//The Resample Freq() function generates a frequence count for sample with replacement
//the .05 argument is the sample rate, i.e. we want a sample size of 5% of the original table
New Column( "Resample Freq",
Numeric,
"Continuous",
Format( "Best", 12 ),
Formula( Resample Freq( 0.05 ) ),
)
);
//Generate a distribution of :parameter using the resample frequency as a Freq column to get
//Automatic Recalc will cause the Distribution to rerun when the Resample Freq column is re-evaluated
//to get a new sample
New Window( "Dist comparison",
H List Box(
dt << Distribution( Continuous Distribution( Column( :parameter ) ) ),
dist = dt << Distribution(
Automatic Recalc( 1 ),
Freq( :Resample Freq ),
Continuous Distribution( Column( :parameter ) )
)
)
);
//set the minimum mean value we will accept for the subsample
target_mean_min = .249;
//set the minimum mean value we will accept for the subsample
target_mean_max = .251;
//how many iterations are we willing to try before giving up?
max_iter = 1000;
iter = 1;
While(
//Report( dist )[Number Col Box( 2 )][1] is the mean value from the Distribution report
!(target_mean_min <= Report( dist )[Number Col Box( 2 )][1] <= target_mean_max) &
iter <= max_iter,
//get a new sample
Column( "Resample Freq" ) << eval formula;
iter++;
);
If(
//check to see if we exited because we hit max_iter or got a good sample
iter > max_iter & !(target_mean_min <= Report( dist )[Number Col Box( 2 )][1] <=
target_mean_max)
,
Print( "No result in " || Char( max_iter ) || " random samples." ),
//got a good sample so get the subset
dt << select where( :resample freq != 0 );
dt << Subset( Selected Rows( 1 ), Selected columns only( 0 ) );
);
```

-Jeff

Highlighted

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

If "So bottom line, I need to sub-select from 400K+ rows a number of sub-samples so that the distribution for the selected sub-sample is as close as possible to the given distribution" is what you want, maybe you could use this kind of approach, or something close to it?

```
NamesDefaultToHere(1);
// Make a big table
dt1 = NewTable("Big", NewColumn("Values", Numeric, Continuous, Formula(RandomUniform(-3,3))));
dt1 << addRows(40000);
dt1 << runFormulas;
Column(dt1, "Values") << deleteFormula;
// Make a small table
dt2 = NewTable("Small", NewColumn("Values", Numeric, Continuous, Formula(RandomNormal(0,1))));
dt2 << addRows(40);
dt2 << runFormulas;
Column(dt2, "Values") << deleteFormula;
// For each value in dt2, find a value in dt1 that's closest
bv = Column(dt1, "Values") << getValues;
sv = Column(dt2, "Values") << getValues;
selectedRows = J(NRow(sv), 1, .);
for(i=1, i<=NRow(sv), i++,
dist = (bv - sv[i])^2;
selectedRows[i] = LocMin(dist);
);
// Subset dt1
dt3 = dt1 << subset(rows(selectedRows), LinkToOriginalDataTable(1));
// Look at the distributions
NewWindow("Compare Distributions",
HListBox(
dt1 << Distribution( Continuous Distribution( Column( :Values ) ) ),
dt2 << Distribution( Continuous Distribution( Column( :Values ) ) ),
dt3 << Distribution( Continuous Distribution( Column( :Values ) ) )
)
);
```

Article Labels

There are no labels assigned to this post.