Subscribe Bookmark RSS Feed

Need help optimizing non-standard join script

miguello

Community Trekker

Joined:

Jan 27, 2016

Folks,

 

I've been working quite some time now on non-standard join scripts, and it looks like I will need to do that a lot.

At this point I need to join two tables.

First one has series of measurements on different tools. Measurements are done in sessions. But the table only has the following in it: tool, eventtime, result.

The second table has all the information on sessions:

tool, sessionid, starttime, endtime

And I need to do analysis of measurements by sessionid

Basically, I need to join unique sessionid to each line in the first table.

The way I currently do that is I iterate through each line in the first table and search for corresponding line in the second table where tool is matched and eventtime is in between starttime and endtime using this:

currentStartEndInfoindex=dt2 << Get Rows Where(:tool==thisTool & :measstart < thisEventtime & :measend > thisEventtime);

where thisTool and thisEventtime are tool and time of the first table I'm iterating through. I get indeces of the list with unique sessionid from the second table that match this condition and then assemble a list from them which at the end of loop is used to populate new column in the first table. This "Get Rows Where" line takes 85% of the script running time.

 

Now, with few million lines in the first table and hundreds of thousands in the second table it takes somewhat between 8 to 16 hours.

I cannot afford it to run this long.

 

Therefore, question:

 

Is there any way to optimize that? Afterall, for each line in the first table it looks through the second table over and over again - not very efficient.

One idea that I have is to break second table into sublists by tool and then do "Get Rows Where" for each line on sublist that corresponds to that specific tool. Have no idea how to implement it the best, though.

Any help would be greatly appreciated.

 

Thanks!

2 REPLIES
Craige_Hales

Staff

Joined:

Mar 21, 2013

Here's a first approximation

 

aaa = New Table( "aaa", // the "event" table
    Add Rows( 6 ),
    New Column( "tool", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [100, 100, 100, 200, 200, 200] ) ),
    New Column( "eventtime", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [11, 21, 12, 31, 41, 42] ) ),
    New Column( "result", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [0, 1, 1, 1, 0, 0] ) ),
    New Column( "sessionID_added", Numeric, "Continuous", Format( "Best", 12 ), Set Selected, Set Values( [., ., ., ., ., .] ) )
);

bbb = New Table( "bbb", // the start-end table
    Add Rows( 4 ),
    New Column( "tool", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [100, 100, 200, 200] ) ),
    New Column( "sessionID", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [111, 222, 333, 444] ) ),
    New Column( "startTime", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [10, 20, 30, 40] ) ),
    New Column( "endTime", Numeric, "Continuous", Format( "Best", 12 ), Set Values( [19, 29, 39, 49] ) )
);

// build two lookup tables that find all the rows associated with a tool

aaaRowsFromTool = Associative Array();
For( irow = 1, irow <= N Rows( aaa ), irow++,
    If( !(aaaRowsFromTool << Contains( aaa:tool[irow] )), //
        aaaRowsFromTool[aaa:tool[irow]] = Eval List( {irow} )//
    , //
        Insert Into( aaaRowsFromTool[aaa:tool[irow]], irow ) //
    )
);

bbbRowsFromTool = Associative Array();
For( irow = 1, irow <= N Rows( bbb ), irow++,
    If( !(bbbRowsFromTool << Contains( bbb:tool[irow] )), //
        bbbRowsFromTool[bbb:tool[irow]] = Eval List( {irow} )//
    , //
        Insert Into( bbbRowsFromTool[bbb:tool[irow]], irow ) //
    )
);


toolList = bbbRowsFromTool << getkeys;
For( itool = 1, itool <= N Items( toolList ), itool += 1,
    toolno = toolList[itool];
    // build a sorted list of event times+row numbers for this tool
    aaaRowList = aaaRowsFromTool[toolno];
    aaa << selectrows( aaaRowList );
    
    aaasub = aaa << Subset( Selected Rows( 1 ), Selected columns only( 0 ), link to original data table( 1 ) );
    //aaasub << sort( by( eventTime ), replace table );
    timeMatrix = aaasub:eventtime << getAsMatrix;
    bbbRowList = bbbRowsFromTool[toolno];
    For( irow = 1, irow <= N Items( bbbRowList ), irow += 1,
        bbbRow = bbbRowList[irow];
        start = bbb:startTime[bbbRow];
        end = bbb:endTime[bbbRow];
        id = bbb:sessionID[bbbRow];
    	// find the eventTimes between start and end, for this tool
        //aaasub<<selectwhere(start < eventTime < end);
        locs = Loc( start <= timeMatrix <= end );
        aaasub[locs, {sessionID_added}] = id;
    );
    Close( aaasub, "nosave" );
);

It it works, but is too slow still, you might want to take advantage of sorting the "event" table by eventTime and replacing the loc( ) with a binary lookup.  

Craige
miguello

Community Trekker

Joined:

Jan 27, 2016

Craige,

Thanks a lot for the whole script. With other projects at hand I'll be evaluating and implementing your idea some time next week. I'll test it and let you know if it works as expected and if it gives any benefit in speed.
Thanks a lot again!