Solved: Random subset selection for a given distribution - Page 2

altug_bayram · Oct 15, 2018 03:33 PM

Hello ...

I am trying to randomly select from a column of data for a given normal distribution . The given distribution is different than that of the data.

Is there an easy way of accomplishing this with Table Subset's current options ? I could not see anything that could help me.

I also tried a formula, basically NormalDensity (X, mu, sigma) to plot the densities of the "data" against the given distribution.

But I still don't know how to say uniformly & randomly select from this formula so my outcome is as close as possible to the given distribution.

Thx. Appreciate the support.

altug_bayram · Oct 17, 2018 01:14 PM

Ian,

This is brilliant. I sort of thought of a similar approach but was unable to do so due to my basic script knowledge.

I was able to adapt it to my data and am getting what I want. I may perhaps expand this to 2-parameter distribution selection which I should be able to do from your script.

thanks for your support.

altug_bayram · Oct 17, 2018 01:19 PM

Ian,

Forgot to ask one question. Is your selection done uniquely from original table or does this have a chance to have duplicate selections into the sub-table ?

thx

ian_jmp · Oct 17, 2018 01:38 PM

Yes, I wondered about this. Right now it does, but if you don’t want duplicates, just add an extra line that sets the element of the vector ‘bv’ that has been found to missing ‘.’.

altug_bayram · Oct 17, 2018 02:25 PM

Ian

I was aiming to get ~46K of sub-sample data, of which 44K have been duplicates.

Could you pls write me that line for "bv" for your suggestion ?

thx again.

ian_jmp · Oct 17, 2018 02:33 PM

I’m sorry - I won’t have access to a computer for a few days due to a vacation. But I’m sure someone else can do it . . .

Jeff_Perkinson · Oct 17, 2018 05:58 PM

The approach that @ian_jmp found was the first thing I tried, but I abandoned it too early apparently as Ian found a way to overcome some shortcomings that I ran into. Nice job.

As for eliminating resampling of rows, I believe that Ian was suggesting adding the line:

	bv[Loc Min( dist )] = .;

to the for() loop. Thusly:

Names Default To Here( 1 );

// Make a big table
dt1 = New Table( "Big",
	New Column( "Values", Numeric, Continuous, Formula( Random Uniform( -3, 3 ) ) ),
	New Column("Row", Formula(Row()))
);
dt1 << addRows( 40000 );
dt1 << runFormulas;
Column( dt1, "Values" ) << deleteFormula;

// Make a small table
dt2 = New Table( "Small",
	New Column( "Values", Numeric, Continuous, Formula( Random Normal( 0, 1 ) ) )
);
dt2 << addRows( 40 );
dt2 << runFormulas;
Column( dt2, "Values" ) << deleteFormula;

// For each value in dt2, find a value in dt1 that's closest
bv = Column( dt1, "Values" ) << getValues;
sv = Column( dt2, "Values" ) << getValues;
selectedRows = J( N Row( sv ), 1, . );
For( i = 1, i <= N Row( sv ), i++,
	dist = (bv - sv[i]) ^ 2;
	selectedRows[i] = Loc Min( dist );
	//the line below prevents rows from the "Big" table from being re-sampled
	bv[Loc Min( dist )] = .;
);

// Subset dt1
dt3 = dt1 << subset( rows( selectedRows ), LinkToOriginalDataTable( 1 ) );

// Look at the distributions
New Window( "Compare Distributions",
	H List Box(
		dt1 << Distribution( Continuous Distribution( Column( :Values ) ) ),
		dt2 << Distribution( Continuous Distribution( Column( :Values ) ) ),
		dt3 << Distribution( Continuous Distribution( Column( :Values ) ) )
	)
);

-Jeff

altug_bayram · Oct 17, 2018 09:51 PM

Jeff

Your modification on Ian's original solution worked perfectly. I can now extract random data w/ a prescribed distribution. It would be nice to see this as an option under Table Subsetting ...

My problem is 100% solved.

thanks for your support.

altug_bayram · Oct 17, 2018 01:12 PM

Jeff,

Thanks for your effort.

Your suggestion had a slight chance. I applied it to my data and unfortunately since the selection is random, it did not work. I ran it for about 3000 iterations and stopped it after 10 hours. The movement on the mean of data was insignifcant.

This isn't a simple technique (unless your name is Ian) .. I would imagine some sort of optimizer being involved trying to random pull from the direction that it makes more sense. GA's may be an overkill but certainly can do this kind of optimization.

From statistical perspective, this is perfectly ok for what I do. The 400+K represents a big data of assets.. I am trying to find assets that are similar to a sub-group under investigation.

thanks again.

StevenJLandry · Jan 17, 2020 08:59 AM

You have got to be kidding me. Getting a random sample from another column, and showing it to you in a separate column, is a menu selection in Minitab. The answer to this took two pages of discussion and dozens of lines of code. Why are simple things so difficult in JMP?

Jeff_Perkinson · Jan 17, 2020 10:00 AM

The original request here was not for simple random sampling of another column.

If that's all you need you can use the Col Shuffle() function to generate a new column with random selection from another.

-Jeff

Random subset selection for a given distribution

Re: Random subset selection for a given distribution

Re: Random subset selection for a given distribution

Re: Random subset selection for a given distribution

Re: Random subset selection for a given distribution

Re: Random subset selection for a given distribution

Re: Random subset selection for a given distribution

Re: Random subset selection for a given distribution

Re: Random subset selection for a given distribution

Re: Random subset selection for a given distribution

Re: Random subset selection for a given distribution