Still playing with twitter trends. I have enough data collected that I'd like to speed up processing. I want to assign an integer index to each trending word by using an associative array indexed with the trend word. Below there are three techniques for adding a new word to the table or getting the index for an old word.
The first one uses TRY to catch the exception when a word is not in the table. It is the fastest if the word is present and slowest (by far) if the exception happens.
The second uses CONTAINS to test if the word is present, then adds it or looks it up. It is the fastest for adding and slowest for looking up (because it looks up twice.)
The last uses an associative array default value to see if the retrieved value already exists. It provides a nice compromise speed.
All the examples have a "t" loop that runs twice. The first time the xN array is empty and must be filled in. The second time all the values will be already present.
// Three ways to build a lookup table. No clear winner!
// Goal: assign an integer, starting at 1, to each new
// entry. For timing purposes, use the integer as the
// key and the value. The key will be a string in real
// code. The value will be used as an index elsewhere.
explain = {"create new entry","use existing entry"};
N = 1e5; // several seconds at 1e5
/*
Use TRY to determine when an entry needs to be created.
On the "catch" create the entry. If there is no catch,
the value is already fetched. The "t" loop runs 20X
slower the first time because the "catch" is expensive.
But the second time the "t" loop is the fastest solution.
*/
x1 = [=> ];
For( t = 1, t <= 2, t += 1,
start = HP Time();
For( ix = 1, ix < N, ix += 1,
Try( y = x1[ix], y = (x1[ix] = ix) )
);
stop = HP Time();
Write( Eval Insert( "\!nTry method: ^char((stop - start)/1e6,5,3)^ seconds ^explain[t]^" ) );
);
/*
Use CONTAINS to determine if the entry needs to be created.
Use the if's "then" clause to fetch the existing index or
"else" clause to create a new entry. The "t" loop has the
fastest speed on the first run, by far, and the slowest
speed on the second run.
*/
x2 = [=> ];
For( t = 1, t <= 2, t += 1,
start = HP Time();
For( ix = 1, ix < N, ix += 1,
If( Contains( x2, ix ),
y = x2[ix],
y = (x2[ix] = ix)
)
);
stop = HP Time();
Write( Eval Insert( "\!nContains method: ^char((stop - start)/1e6,5,3)^ seconds ^explain[t]^" ) );
);
/*
Use a DEFAULT value to determine if the entry needs to be
created. Fetch the value for the entry, check for the default
and create if needed. This gives an between value for both
"t" loops. It might be a good compromise choice if no extra
testing will be done and 0 can be used as a default. !y,
below, is faster than y==0.
*/
x3 = [=> 0];
For( t = 1, t <= 2, t += 1,
start = HP Time();
For( ix = 1, ix < N, ix += 1,
y = x3[ix];
If( !y, y = (x3[ix] = ix) );
);
stop = HP Time();
Write( Eval Insert( "\!ndefault 0 method: ^char((stop - start)/1e6,5,3)^ seconds ^explain[t]^" ) );
);
// quick test... the lack of a default value in x1 and x2
// makes them compare not equal, so add it before testing.
x1 << setdefaultvalue( 0 );
x2 << setdefaultvalue( 0 );
Show( x1 == x2, x1 == x3 );
Try method: 2.851 seconds create new entry
Try method: 0.099 seconds use existing entry
Contains method: 0.209 seconds create new entry
Contains method: 0.151 seconds use existing entry
default 0 method: 0.259 seconds create new entry
default 0 method: 0.117 seconds use existing entry
x1 == x2 = 1;
x1 == x3 = 1;
In red, TRY and CONTAINS provide the fastest times for existing and new entries. In green, DEFAULT 0 might be a good compromise. It really depends on how many entries will be created, and how many times they will be looked up.