Looking for tips for rapid conversion of string to matrix

tsl · Jun 8, 2023 9:36 AM

I have data that I retrieve from a database which is in a pipe-delimited string format. For example "70|71|69|78". These strings represent numeric measurements.

A large number of measurements are made and the resulting strings are really long. On top of that I have a large dataset of these strings.

What I need to do is convert the strings to a matrix ( vector really ) and run some aggregating statistics on the resulting vector ( Max, 95th percentile etc ). I haven't found a way to do it in SQL on the database side which would be better still.

Here's how I'm doing it today, using a typical lengthy string example.

ExampleStr = "70|71|71|72|72|72|72|72|74|74|74|74|74|74|74|74|74|74|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|76|75|75|76|76|76|75|75|76|76|75|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|77|76|77|76|77|76|76|77|76|76|76|76|76|76|76|76|77|76|76|76|76|77|77|77|77|76|76|77|77|76|76|76|76|77|76|77|76|76|76|76|76|77|76|76|76|77|77|77|76|76|77|76|76|76|77|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|77|77|77|77|77|77|77|77|76|77|77|76|76|77|76|76|77|76|77|76|77|76|77|77|77|76|76|76|77|77|76|76|77|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|77|77|77|76|77|77|76|76|77|76|77|76|77|77|77|77|77|76|76|77|77|77|77|77|77|76|76|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|76|76|76|77|77|77|77|77|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|76|76|77|77|77|77|77|77|76|77|77|77|76|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|76|77|77|77|77|76|77|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|78|78|78|78|78|78|78|78|78|78|77|78|78|77|78|77|77|77|78|78|78|78|78|78|77|78|77|78|78|78|78|78|78|77|78|78|78|77|78|77|78|78|78|77|78|77|78|78|78|78|78|77|78|78|78|78|78|78|78|78|78|77|78|78|78|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|78|77|77|78|77|77|78|77|78|78|78|78|78|78|78|78|77|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|77|78|78|78|78|78|78|77|78|78|78|78|77|78|78|78|78|77|77|78|78|78|78|78|78";
		t1 = HPTime();
		V = parse("["||substitute(ExampleStr,"|",",")||"]") ;
		t2 = HPTime();
		show((t2-t1)/1000); // takes about 0.9 milliSec

So I replace the pipes with commas, then make the string look like a matrix by prefixing and suffixing matrix brackets. That part is quick.

Where the bulk of the time goes is in the Parse() call.

For this example with 775 values in the pipe-delimited string, it takes about 0.9 mSec to do the above code on my computer.

That might seem like very little time, but I need it to be at least an order of magnitude less. Because my data table, in my current example, has 762,000 rows of data!

Not every string is 775 values long, some are shorter, and not every iteration takes 0.9 mS, but in total it took almost 10 mins to go through the data table and do what I need to do which is something like this :

dt << new column("P95_Value");
foreachrow(
	V = parse("["||substitute(:StrColumn,"|",",")||"]") ;
	:P95_Value = VQuantile(V,0.95);	
);

The VQuantile() call is not a problem, it is very fast. The bulk of the time goes on the parse() line.

I just wondered is there any smarter way to process this string and turn it into a vector, ideally in more like 0.09 milliSec or better !

Perhaps I may need to "farm out" this processing to a C# or Python script which may handle it faster ?

txnelson · Feb 14, 2023 02:33 PM

Here is an interesting way to solve your issue. It uses the function Words() to convert the string into a JMP List with separate elements. Then, by using it to create a new column in a data table, it converts the values to numeric. Moving those values back into a matrix get to the result needed.

theList=words(ExampleStr,"|");
dtConvert=new table("temp", new column("theList", set values(theList)));
theMatrix = dtConvert:theList<<get values;
close( dtConvert, nosave );

Jim

tsl · Feb 15, 2023 04:13 PM

@txnelson , while that suggestion does indeed work, it is no faster than my current solution. In fact it is slower, taking about 4ms on my system vs 0.9 for the solution I'm using.

I can improve your solution by making the data table private, taking the total time down from ~ 4ms to ~ 1ms which is comparable although a little slower than what I'm doing today

dtConvert=new table("temp", new column("theList", set values(theList)), private);

I'm hoping I can find something much faster but I may need a C# solution to achieve that

UersK · Feb 15, 2023 09:28 PM

Jim's

0 ms

UersK · Feb 15, 2023 10:19 PM

How can write such data into a matrix?

null direct removal

Thank Jim!

txt="[[1668,3140325,7.25,7.27],[1660,4677712,7.17,null]]";

lala · Feb 16, 2023 3:59 PM

0.217 ms

tsl · Feb 16, 2023 06:12 PM

@lala , note you were using Tick Seconds() which gives you times in seconds, whereas I was using HPTime() / 1000 which gives me times in Milliseconds. Your example is actually 216 msec which is 2 orders of magnitude worse than my solution which does it in < 1 ms, which is still much too slow for me !

Letting d1 be a private data table in your example does halve the time down to ~ 100msec, but still... much too slow for what I needed to do

lala · Feb 16, 2023 07:07 PM

I see

Craige_Hales · Feb 17, 2023 07:31 AM

CSV import can handle the pipe character directly. And MFI can run the imports in parallel. This example opens 1000 files in 2.2 seconds and stacks them in a single data table. If your files are already on disk, this might work well. Be careful with the delete directory command!

ExampleStr =
"70|71|71|72|72|72|72|72|74|74|74|74|74|74|74|74|74|74|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|76|75|75|76|76|76|75|75|76|76|75|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|77|76|77|76|77|76|76|77|76|76|76|76|76|76|76|76|77|76|76|76|76|77|77|77|77|76|76|77|77|76|76|76|76|77|76|77|76|76|76|76|76|77|76|76|76|77|77|77|76|76|77|76|76|76|77|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|77|77|77|77|77|77|77|77|76|77|77|76|76|77|76|76|77|76|77|76|77|76|77|77|77|76|76|76|77|77|76|76|77|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|77|77|77|76|77|77|76|76|77|76|77|76|77|77|77|77|77|76|76|77|77|77|77|77|77|76|76|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|76|76|76|77|77|77|77|77|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|76|76|77|77|77|77|77|77|76|77|77|77|76|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|76|77|77|77|77|76|77|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|78|78|78|78|78|78|78|78|78|78|77|78|78|77|78|77|77|77|78|78|78|78|78|78|77|78|77|78|78|78|78|78|78|77|78|78|78|77|78|77|78|78|78|77|78|77|78|78|78|78|78|77|78|78|78|78|78|78|78|78|78|77|78|78|78|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|78|77|77|78|77|77|78|77|78|78|78|78|78|78|78|78|77|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|77|78|78|78|78|78|78|77|78|78|78|78|77|78|78|78|78|77|77|78|78|78|78|78|78";
dir = "$temp/test";
Delete Directory( dir );
Create Directory( dir );
For( i = 1, i <= 1000, i += 1,
    Save Text File( Eval Insert( "^dir^/ExampleStr^right(char(1000+i),4)^.txt" ), Char( 1000 + i ) || "|" || ExampleStr )
);

t1 = HP Time();
Multiple File Import(
    <<Set Folder( dir ),
    <<Set Name Filter( "*.txt; " ),
    <<Set Name Enable( 1 ),
    <<Set CSV Has Headers( 0 ),
    <<Set CSV First Data Line( 1 ),
    <<Set CSV EOL Other( "|" )
) << Import Data;
t2 = HP Time();
Show( (t2 - t1) / 1000 ); // 1000 files in 2.2 seconds

Even the CSV one-at-a-time import @lala showed can be improved by letting the CSV import handle the pipe character, and then does not require a disk file like MFI does.

The MFI script above was created by the MFI dialog, then the default options trimmed out for appearance.

MFI can also add a column that has the file name which an analysis platform can use as a by variable.

Craige

pmroz · Feb 17, 2023 02:18 PM

Here are some Oracle SQL solutions:

If you're using Oracle and you have APEX installed (Application Express), then apex_string.split will work. Took .89 seconds on my DB

SELECT COLUMN_VALUE FROM TABLE(apex_string.split(
        '70|71|71|72|72|72|72|72|74|74|74|74|74|74|74|74|74|74|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|76|75|75|76|76|76|75|75|76|76|75|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|77|76|77|76|77|76|76|77|76|76|76|76|76|76|76|76|77|76|76|76|76|77|77|77|77|76|76|77|77|76|76|76|76|77|76|77|76|76|76|76|76|77|76|76|76|77|77|77|76|76|77|76|76|76|77|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|77|77|77|77|77|77|77|77|76|77|77|76|76|77|76|76|77|76|77|76|77|76|77|77|77|76|76|76|77|77|76|76|77|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|77|77|77|76|77|77|76|76|77|76|77|76|77|77|77|77|77|76|76|77|77|77|77|77|77|76|76|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|76|76|76|77|77|77|77|77|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|76|76|77|77|77|77|77|77|76|77|77|77|76|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|76|77|77|77|77|76|77|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|78|78|78|78|78|78|78|78|78|78|77|78|78|77|78|77|77|77|78|78|78|78|78|78|77|78|77|78|78|78|78|78|78|77|78|78|78|77|78|77|78|78|78|77|78|77|78|78|78|78|78|77|78|78|78|78|78|78|78|78|78|77|78|78|78|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|78|77|77|78|77|77|78|77|78|78|78|78|78|78|78|78|77|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|77|78|78|78|78|78|78|77|78|78|78|78|77|78|78|78|78|77|77|78|78|78|78|78|78',
        '|'));

If you don't have APEX installed then this will work, but it's slower (3.82 seconds):

WITH rws AS
 (SELECT '70|71|71|72|72|72|72|72|74|74|74|74|74|74|74|74|74|74|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|76|75|75|76|76|76|75|75|76|76|75|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|77|76|77|76|77|76|76|77|76|76|76|76|76|76|76|76|77|76|76|76|76|77|77|77|77|76|76|77|77|76|76|76|76|77|76|77|76|76|76|76|76|77|76|76|76|77|77|77|76|76|77|76|76|76|77|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|77|77|77|77|77|77|77|77|76|77|77|76|76|77|76|76|77|76|77|76|77|76|77|77|77|76|76|76|77|77|76|76|77|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|77|77|77|76|77|77|76|76|77|76|77|76|77|77|77|77|77|76|76|77|77|77|77|77|77|76|76|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|76|76|76|77|77|77|77|77|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|76|76|77|77|77|77|77|77|76|77|77|77|76|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|76|77|77|77|77|76|77|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|78|78|78|78|78|78|78|78|78|78|77|78|78|77|78|77|77|77|78|78|78|78|78|78|77|78|77|78|78|78|78|78|78|77|78|78|78|77|78|77|78|78|78|77|78|77|78|78|78|78|78|77|78|78|78|78|78|78|78|78|78|77|78|78|78|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|78|77|77|78|77|77|78|77|78|78|78|78|78|78|78|78|77|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|77|78|78|78|78|78|78|77|78|78|78|78|77|78|78|78|78|77|77|78|78|78|78|78|78'
         str
    FROM dual)
SELECT regexp_substr(str, '[^|]+', 1, LEVEL) VALUE
  FROM rws
CONNECT BY LEVEL <= LENGTH(str) - LENGTH(REPLACE(str, '|')) + 1;

Looking for tips for rapid conversion of string to matrix

Re: Looking for tips for rapid conversion of string to matrix

Re: Looking for tips for rapid conversion of string to matrix

Re: Looking for tips for rapid conversion of string to matrix

Re: Looking for tips for rapid conversion of string to matrix

Re: Looking for tips for rapid conversion of string to matrix

Re: Looking for tips for rapid conversion of string to matrix

Re: Looking for tips for rapid conversion of string to matrix

Re: Looking for tips for rapid conversion of string to matrix

Re: Looking for tips for rapid conversion of string to matrix