cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Choose Language Hide Translation Bar
tsl
tsl
Level III

Looking for tips for rapid conversion of string to matrix

have data that I retrieve from a database which is in a pipe-delimited string format. For example "70|71|69|78". These strings represent numeric measurements.

A large number of measurements are made and the resulting strings are really long. On top of that I have a large dataset of these strings.

What I need to do is convert the strings to a matrix ( vector really ) and run some aggregating statistics on the resulting vector ( Max, 95th percentile etc ). I haven't found a way to do it in SQL on the database side which would be better still.

Here's how I'm doing it today, using a typical lengthy string example.

ExampleStr = "70|71|71|72|72|72|72|72|74|74|74|74|74|74|74|74|74|74|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|76|75|75|76|76|76|75|75|76|76|75|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|77|76|77|76|77|76|76|77|76|76|76|76|76|76|76|76|77|76|76|76|76|77|77|77|77|76|76|77|77|76|76|76|76|77|76|77|76|76|76|76|76|77|76|76|76|77|77|77|76|76|77|76|76|76|77|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|77|77|77|77|77|77|77|77|76|77|77|76|76|77|76|76|77|76|77|76|77|76|77|77|77|76|76|76|77|77|76|76|77|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|77|77|77|76|77|77|76|76|77|76|77|76|77|77|77|77|77|76|76|77|77|77|77|77|77|76|76|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|76|76|76|77|77|77|77|77|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|76|76|77|77|77|77|77|77|76|77|77|77|76|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|76|77|77|77|77|76|77|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|78|78|78|78|78|78|78|78|78|78|77|78|78|77|78|77|77|77|78|78|78|78|78|78|77|78|77|78|78|78|78|78|78|77|78|78|78|77|78|77|78|78|78|77|78|77|78|78|78|78|78|77|78|78|78|78|78|78|78|78|78|77|78|78|78|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|78|77|77|78|77|77|78|77|78|78|78|78|78|78|78|78|77|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|77|78|78|78|78|78|78|77|78|78|78|78|77|78|78|78|78|77|77|78|78|78|78|78|78";
		t1 = HPTime();
		V = parse("["||substitute(ExampleStr,"|",",")||"]") ;
		t2 = HPTime();
		show((t2-t1)/1000); // takes about 0.9 milliSec

So I replace the pipes with commas, then make the string look like a matrix by prefixing and suffixing matrix brackets. That part is quick.

Where the bulk of the time goes is in the Parse() call.

For this example with 775 values in the pipe-delimited string, it takes about 0.9 mSec to do the above code on my computer.

That might seem like very little time, but I need it to be at least an order of magnitude less. Because my data table, in my current example, has 762,000 rows of data!

Not every string is 775 values long, some are shorter, and not every iteration takes 0.9 mS, but in total it took almost 10 mins to go through the data table and do what I need to do  which is something like this :

 

dt << new column("P95_Value");
foreachrow(
	V = parse("["||substitute(:StrColumn,"|",",")||"]") ;
	:P95_Value = VQuantile(V,0.95);	
);

The VQuantile() call is not a problem, it is very fast. The bulk of the time goes on the parse() line.

I just wondered is there any smarter way to process this string and turn it into a vector, ideally in more like 0.09 milliSec or better !

Perhaps I may need to "farm out" this processing to a C# or Python script which may handle it faster ?

10 REPLIES 10
txnelson
Super User

Re: Looking for tips for rapid conversion of string to matrix

Here is an interesting way to solve your issue.  It uses the function Words() to convert the string into a JMP List with separate elements.  Then, by using it to create a new column in a data table, it converts the values to numeric.  Moving those values back into a matrix get to the result needed.

theList=words(ExampleStr,"|");
dtConvert=new table("temp", new column("theList", set values(theList)));
theMatrix = dtConvert:theList<<get values;
close( dtConvert, nosave );
Jim
tsl
tsl
Level III

Re: Looking for tips for rapid conversion of string to matrix

@txnelson , while that suggestion does indeed work, it is no faster than my current solution. In fact it is slower, taking about 4ms on my system vs 0.9 for the solution I'm using.

I can improve your solution by making the data table private, taking the total time down from ~ 4ms to ~  1ms which is comparable although a little slower than what I'm doing today

dtConvert=new table("temp", new column("theList", set values(theList)), private);

I'm hoping I can find something much faster but I may need a C# solution to achieve that

UersK
Level III

Re: Looking for tips for rapid conversion of string to matrix

Jim's 

0 ms

UersK
Level III

Re: Looking for tips for rapid conversion of string to matrix

How can write such data into a matrix?

null direct removal

 

Thank Jim!

txt="[[1668,3140325,7.25,7.27],[1660,4677712,7.17,null]]";

2023-02-16_11-18-48.png

lala
Level VII

Re: Looking for tips for rapid conversion of string to matrix

0.217 ms

 

 

2023-02-16_10-02-59.png

tsl
tsl
Level III

Re: Looking for tips for rapid conversion of string to matrix

@lala , note you were using Tick Seconds() which gives you times in seconds, whereas I was using HPTime() / 1000 which gives me times in Milliseconds. Your example is actually 216 msec which is 2 orders of magnitude worse than my solution which does it in < 1 ms, which is still much too slow for me !

Letting d1 be a private data table in your example does halve the time down to ~ 100msec, but still... much too slow for what I needed to do

lala
Level VII

Re: Looking for tips for rapid conversion of string to matrix

I see

2023-02-17_08-06-13.png

Craige_Hales
Super User

Re: Looking for tips for rapid conversion of string to matrix

CSV import can handle the pipe character directly. And MFI can run the imports in parallel. This example opens 1000 files in 2.2 seconds and stacks them in a single data table. If your files are already on disk, this might work well. Be careful with the delete directory command!

ExampleStr =
"70|71|71|72|72|72|72|72|74|74|74|74|74|74|74|74|74|74|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|76|75|75|76|76|76|75|75|76|76|75|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|77|76|77|76|77|76|76|77|76|76|76|76|76|76|76|76|77|76|76|76|76|77|77|77|77|76|76|77|77|76|76|76|76|77|76|77|76|76|76|76|76|77|76|76|76|77|77|77|76|76|77|76|76|76|77|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|77|77|77|77|77|77|77|77|76|77|77|76|76|77|76|76|77|76|77|76|77|76|77|77|77|76|76|76|77|77|76|76|77|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|77|77|77|76|77|77|76|76|77|76|77|76|77|77|77|77|77|76|76|77|77|77|77|77|77|76|76|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|76|76|76|77|77|77|77|77|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|76|76|77|77|77|77|77|77|76|77|77|77|76|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|76|77|77|77|77|76|77|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|78|78|78|78|78|78|78|78|78|78|77|78|78|77|78|77|77|77|78|78|78|78|78|78|77|78|77|78|78|78|78|78|78|77|78|78|78|77|78|77|78|78|78|77|78|77|78|78|78|78|78|77|78|78|78|78|78|78|78|78|78|77|78|78|78|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|78|77|77|78|77|77|78|77|78|78|78|78|78|78|78|78|77|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|77|78|78|78|78|78|78|77|78|78|78|78|77|78|78|78|78|77|77|78|78|78|78|78|78";
dir = "$temp/test";
Delete Directory( dir );
Create Directory( dir );
For( i = 1, i <= 1000, i += 1,
    Save Text File( Eval Insert( "^dir^/ExampleStr^right(char(1000+i),4)^.txt" ), Char( 1000 + i ) || "|" || ExampleStr )
);

t1 = HP Time();
Multiple File Import(
    <<Set Folder( dir ),
    <<Set Name Filter( "*.txt; " ),
    <<Set Name Enable( 1 ),
    <<Set CSV Has Headers( 0 ),
    <<Set CSV First Data Line( 1 ),
    <<Set CSV EOL Other( "|" )
) << Import Data;
t2 = HP Time();
Show( (t2 - t1) / 1000 ); // 1000 files in 2.2 seconds

Even the CSV one-at-a-time import @lala  showed can be improved by letting the CSV import handle the pipe character, and then does not require a disk file like MFI does.

The MFI script above was created by the MFI dialog, then the default options trimmed out for appearance.

MFI can also add a column that has the file name which an analysis platform can use as a by variable.

 

Craige
pmroz
Super User

Re: Looking for tips for rapid conversion of string to matrix

Here are some Oracle SQL solutions:

 

If you're using Oracle and you have APEX installed (Application Express), then apex_string.split will work.  Took .89 seconds on my DB

SELECT COLUMN_VALUE FROM TABLE(apex_string.split(
        '70|71|71|72|72|72|72|72|74|74|74|74|74|74|74|74|74|74|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|76|75|75|76|76|76|75|75|76|76|75|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|77|76|77|76|77|76|76|77|76|76|76|76|76|76|76|76|77|76|76|76|76|77|77|77|77|76|76|77|77|76|76|76|76|77|76|77|76|76|76|76|76|77|76|76|76|77|77|77|76|76|77|76|76|76|77|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|77|77|77|77|77|77|77|77|76|77|77|76|76|77|76|76|77|76|77|76|77|76|77|77|77|76|76|76|77|77|76|76|77|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|77|77|77|76|77|77|76|76|77|76|77|76|77|77|77|77|77|76|76|77|77|77|77|77|77|76|76|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|76|76|76|77|77|77|77|77|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|76|76|77|77|77|77|77|77|76|77|77|77|76|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|76|77|77|77|77|76|77|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|78|78|78|78|78|78|78|78|78|78|77|78|78|77|78|77|77|77|78|78|78|78|78|78|77|78|77|78|78|78|78|78|78|77|78|78|78|77|78|77|78|78|78|77|78|77|78|78|78|78|78|77|78|78|78|78|78|78|78|78|78|77|78|78|78|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|78|77|77|78|77|77|78|77|78|78|78|78|78|78|78|78|77|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|77|78|78|78|78|78|78|77|78|78|78|78|77|78|78|78|78|77|77|78|78|78|78|78|78',
        '|'));

 

If you don't have APEX installed then this will work, but it's slower (3.82 seconds):

WITH rws AS
 (SELECT '70|71|71|72|72|72|72|72|74|74|74|74|74|74|74|74|74|74|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|76|75|75|76|76|76|75|75|76|76|75|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|77|76|77|76|77|76|76|77|76|76|76|76|76|76|76|76|77|76|76|76|76|77|77|77|77|76|76|77|77|76|76|76|76|77|76|77|76|76|76|76|76|77|76|76|76|77|77|77|76|76|77|76|76|76|77|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|77|77|77|77|77|77|77|77|76|77|77|76|76|77|76|76|77|76|77|76|77|76|77|77|77|76|76|76|77|77|76|76|77|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|77|77|77|76|77|77|76|76|77|76|77|76|77|77|77|77|77|76|76|77|77|77|77|77|77|76|76|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|76|76|76|77|77|77|77|77|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|76|76|77|77|77|77|77|77|76|77|77|77|76|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|76|77|77|77|77|76|77|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|78|78|78|78|78|78|78|78|78|78|77|78|78|77|78|77|77|77|78|78|78|78|78|78|77|78|77|78|78|78|78|78|78|77|78|78|78|77|78|77|78|78|78|77|78|77|78|78|78|78|78|77|78|78|78|78|78|78|78|78|78|77|78|78|78|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|78|77|77|78|77|77|78|77|78|78|78|78|78|78|78|78|77|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|77|78|78|78|78|78|78|77|78|78|78|78|77|78|78|78|78|77|77|78|78|78|78|78|78'
         str
    FROM dual)
SELECT regexp_substr(str, '[^|]+', 1, LEVEL) VALUE
  FROM rws
CONNECT BY LEVEL <= LENGTH(str) - LENGTH(REPLACE(str, '|')) + 1;