cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Choose Language Hide Translation Bar
View Original Published Thread

fastest way to get distinct items from a matrix/list

vince_faller
Super User (Alumni)

I'm trying to get the distinct items in a fairly large vector and I was wondering if anyone knew of a faster way than making an associative array.  I tried the following.  

 

Names Default to here(1);
dt = open("$SAMPLE_DATA\Probe.jmp");
rows = dt << Get rows Where(Num(:Wafer Number) <= 10);
times = Column(dt, "Start Time")[rows];

//option 1
st = HPTime();
distinct1 = associative array(as list(times));
distinct1 = distinct1 << Get Keys;
opt1 = HPTime()-st;
show(opt1);

//option 2
st = HPTime();
dt_sub = dt  << Subset(
	Selected Rows( 0 ),
	Rows( rows ),
	Selected columns only( 0 )
);
Summarize(dt_sub, distinct2 = by(:Start Time));
close(dt_sub, no save);
opt2 = HPTime()-st;
show(opt2);

//option 3
st = HPTime();
distinct3 = [];
for(i=1, i<=nrows(times), i++, 
	if(!any(distinct3 == times[i]), 
		distinct3 ||= times[i]
	)
);
opt3 = HPTime()-st;
show(opt3);

//option 4
st = HPTime();
distinct4 = {};
for(i=1, i<=nrows(times), i++, 
	if(!Contains(distinct4, times[i]), 
		insert into(distinct4, times[i])
	)
);
opt4 = HPTime()-st;
show(opt4);

show(nitems(distinct1), nitems(distinct2), nrows(distinct3`), nitems(distinct4));

Which gave an output of:

opt1 = 1082;
opt2 = 40914;
opt3 = 5853;
opt4 = 8001;
N Items(distinct1) = 211;
N Items(distinct2) = 211;
N Rows(distinct3`) = 211;
N Items(distinct4) = 211;

 

 

*Edit* Okay, definitely DON'T use associative array because it doesn't allow floating point keys, it rounds everything.  The for loop seems pretty slow for this operation so anyone have anything better?

Vince Faller - Predictum
1 ACCEPTED SOLUTION

Accepted Solutions
Craige_Hales
Super User


Re: fastest way to get distinct items from a matrix/list

Nice use of Summary.

 

The best solution may be different for different size problems; a large setup overhead might pay off on a large enough problem. 

 

I was investigating how to turn floating point numbers into keys for associative arrays and only came up with slower answers involving strings made from the numbers.

 

 

You might scale the floating point numbers into integers between +/- 2^52 and use the integers to index associative arrays.  Yes, 2^52 not 2^32 and not 2^64. The 8-byte double preceision floating point numbers JMP uses have a 52 bit fraction wikipedia .  This would be a lossy conversion but could keep most of the information.

 

Craige

View solution in original post

5 REPLIES 5


Re: fastest way to get distinct items from a matrix/list

I did the same using an associative array. As you note, to handle numbers less than zero/with many decimal places you need to first take a sample of the data, find out what a good scaling factor would be and then scale the elements of the vector.

That still seem to work faster than using a for loop since it is a bit of matrix manipulation followed by creating the associative array.
txnelson
Super User


Re: fastest way to get distinct items from a matrix/list

Here is another alternative that you might want to try

Names Default To Here( 1 );
dt = Open( "$SAMPLE_DATA\Probe.jmp" );
dt << Select Where( Num( :Wafer Number ) > 10 );
dt << exclude;

dtSumm = dt << Summary(
	private,
	Group( :Start Time ),
	Freq( "None" ),
	Weight( "None" ),
	statistics column name format( "column" ),
	Link to original data table( 0 )
);
dtSumm << delete rows;
Distinct = dtSumm:Start Time << get values;

Close( dtSumm, nosave );
dt << clear row states;
Jim
Craige_Hales
Super User


Re: fastest way to get distinct items from a matrix/list

Nice use of Summary.

 

The best solution may be different for different size problems; a large setup overhead might pay off on a large enough problem. 

 

I was investigating how to turn floating point numbers into keys for associative arrays and only came up with slower answers involving strings made from the numbers.

 

 

You might scale the floating point numbers into integers between +/- 2^52 and use the integers to index associative arrays.  Yes, 2^52 not 2^32 and not 2^64. The 8-byte double preceision floating point numbers JMP uses have a 52 bit fraction wikipedia .  This would be a lossy conversion but could keep most of the information.

 

Craige
vince_faller
Super User (Alumni)


Re: fastest way to get distinct items from a matrix/list

For most of my sets I've tested.  The scaling by 2^52 does seem to be the fastest (with acceptable loss).  Thanks for all the feedback.  

Vince Faller - Predictum
ms
Super User (Alumni) ms
Super User (Alumni)


Re: fastest way to get distinct items from a matrix/list

Not sure how it compares in speed, but it's faster to type...

 

distinct = Design(times, <<levels)[2];