I have data that I retrieve from a database which is in a pipe-delimited string format. For example "70|71|69|78". These strings represent numeric measurements.
A large number of measurements are made and the resulting strings are really long. On top of that I have a large dataset of these strings.
What I need to do is convert the strings to a matrix ( vector really ) and run some aggregating statistics on the resulting vector ( Max, 95th percentile etc ). I haven't found a way to do it in SQL on the database side which would be better still.
Here's how I'm doing it today, using a typical lengthy string example.
ExampleStr = "70|71|71|72|72|72|72|72|74|74|74|74|74|74|74|74|74|74|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|75|76|75|75|76|76|76|75|75|76|76|75|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|77|76|77|76|77|76|76|77|76|76|76|76|76|76|76|76|77|76|76|76|76|77|77|77|77|76|76|77|77|76|76|76|76|77|76|77|76|76|76|76|76|77|76|76|76|77|77|77|76|76|77|76|76|76|77|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|76|76|77|77|77|77|77|77|77|77|76|77|77|76|76|77|76|76|77|76|77|76|77|76|77|77|77|76|76|76|77|77|76|76|77|76|76|76|76|76|76|76|77|76|76|76|76|76|76|76|77|77|77|76|77|77|76|76|77|76|77|76|77|77|77|77|77|76|76|77|77|77|77|77|77|76|76|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|76|76|76|77|77|77|77|77|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|76|76|77|77|77|77|77|77|76|77|77|77|76|77|77|77|77|76|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|76|77|77|77|77|76|77|76|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|77|76|77|77|77|77|77|77|77|77|77|77|77|77|77|78|78|78|78|78|78|78|78|78|78|77|78|78|77|78|77|77|77|78|78|78|78|78|78|77|78|77|78|78|78|78|78|78|77|78|78|78|77|78|77|78|78|78|77|78|77|78|78|78|78|78|77|78|78|78|78|78|78|78|78|78|77|78|78|78|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|78|77|77|78|77|77|78|77|78|78|78|78|78|78|78|78|77|77|78|78|78|77|77|78|78|78|78|78|78|78|78|78|78|78|78|77|78|78|78|78|78|78|77|78|78|78|78|77|78|78|78|78|77|77|78|78|78|78|78|78";
t1 = HPTime();
V = parse("["||substitute(ExampleStr,"|",",")||"]") ;
t2 = HPTime();
show((t2-t1)/1000); // takes about 0.9 milliSec
So I replace the pipes with commas, then make the string look like a matrix by prefixing and suffixing matrix brackets. That part is quick.
Where the bulk of the time goes is in the Parse() call.
For this example with 775 values in the pipe-delimited string, it takes about 0.9 mSec to do the above code on my computer.
That might seem like very little time, but I need it to be at least an order of magnitude less. Because my data table, in my current example, has 762,000 rows of data!
Not every string is 775 values long, some are shorter, and not every iteration takes 0.9 mS, but in total it took almost 10 mins to go through the data table and do what I need to do which is something like this :
dt << new column("P95_Value");
foreachrow(
V = parse("["||substitute(:StrColumn,"|",",")||"]") ;
:P95_Value = VQuantile(V,0.95);
);
The VQuantile() call is not a problem, it is very fast. The bulk of the time goes on the parse() line.
I just wondered is there any smarter way to process this string and turn it into a vector, ideally in more like 0.09 milliSec or better !
Perhaps I may need to "farm out" this processing to a C# or Python script which may handle it faster ?