All,
Wondering if someone has done similar trails to time matrix vs data table.
Clear Log(); Clear Globals(); Close All(DataTables,"No Save");
// Inputs
a = 7;
// Approach 1
TimerStart_1 = Tick Seconds();
dt1 = New Table("Approach-1","Invisible");
dt1 << New Column("Random-1",Numeric,Continuous,<< Set Values(Random Index(10^8,10^a)))
<< New Column("Random-2",Numeric,Continuous,<< Set Values(Random Index(10^8,10^a)))
<< New Column("Add",Numeric,Continuous,Formula(:Name("Random-1")+:Name("Random-2")))
<< New Column("Subtract",Numeric,Continuous,Formula(:Name("Random-1")-:Name("Random-2")))
<< New Column("Multiply",Numeric,Continuous,Formula(:Name("Random-1")*:Name("Random-2")))
<< New Column("Mod",Numeric,Continuous,Formula(Mod(:Name("Random-1"),:Name("Random-2"))));
TimerEnd_1 = Tick Seconds();
Show(TimerEnd_1 - TimerStart_1);
Close All(DataTables,"No Save");
// Approach 2
TimerStart_2 = Tick Seconds();
Mat_1 = Random Index(10^8,10^a);
Mat_2 = Random Index(10^8,10^a);
Add = Mat_1 + Mat_2 ;
Difference = Mat_1 - Mat_2 ;
Prod =E Mult (Mat_1,Mat_2);
Mod = Mod(Mat_1,Mat_2);
TimerEnd_2 = Tick Seconds();
Show(TimerEnd_2 - TimerStart_2);
Making the data table private, might shave some more time, but in general, Is this fair or does aybody favor one data container over another solely for speed ?
You have 2 questions: is this fair? and do you favor one over another?
Regarding fair, memory management is all up to the language provider. Keep in mind tables have more methods than matrices. I would have phrased your first questions as, "Is this change in preformance between 1 million and 10 million expected?" Also, I am just guessing that after 1 million rows, JMP might be doing some storage compression, in other words saving memory, the trade-off being time.
What I favor depends upon the task. If I am doing a simulation, where a method is applied numerous times and I am getting summary performance, then I'll use a matrix. Prior to JMP 13 which has better subtable referencing, if I had large tables, I would work with matrices then set the values to the table specifically for performance, and I would never use formulas for large tables.
I have attached a script that appends to your script a third method that uses the JMP 13 subtable referencing syntax. The snippet below is the syntax for the Add column. [This site would not allow me to post the attachment, so go to the end to see the fill script.]
dt1[0, "Add"] = dt1[0,"Random-1"]+ dt1[0,"Random-2"] ;
Note since you are interested in performance, try this,
a=5;
tb1 = TickSeconds();
Mat_1 = Random Index(10^8,10^a);
te1 = TickSeconds();
tb2 = TickSeconds();
Mat_2 = Round (J(10^a, 1, Random Uniform(10^8) )*10^8,0);
te2 = TickSeconds();
show(te1-tb1, te2-tb2);
For a<7, the second method is far superior to method 1.
a = 6: te1 - tb1 = 2.43333333334886; te2 - tb2 = 0.25; //second mehod superior
a = 7: te1 - tb1 = 2.48333333339542; te2 - tb2 = 2.56666666665114; //both methods the same
a = 8: te1 - tb1 = 3.03333333344199; te2 - tb2 = 25.8166666666511; //second method much worse
And like you when there is a big difference, I send a note to JMP as an FYI.
For people running the script beware it is closing all tables, etc. Run in a new session of JMP.
Clear Log(); Clear Globals(); Close All(DataTables,"No Save");
// Inputs
a = 7;
// Approach 1
TimerStart_1 = Tick Seconds();
dt1 = New Table("Approach-1","Invisible");
dt1 << New Column("Random-1",Numeric,Continuous,<< Set Values(Random Index(10^8,10^a)))
<< New Column("Random-2",Numeric,Continuous,<< Set Values(Random Index(10^8,10^a)))
<< New Column("Add",Numeric,Continuous,Formula(:Name("Random-1")+:Name("Random-2")))
<< New Column("Subtract",Numeric,Continuous,Formula(:Name("Random-1")-:Name("Random-2")))
<< New Column("Multiply",Numeric,Continuous,Formula(:Name("Random-1")*:Name("Random-2")))
<< New Column("Mod",Numeric,Continuous,Formula(Mod(:Name("Random-1"),:Name("Random-2"))));
TimerEnd_1 = Tick Seconds();
Show(TimerEnd_1 - TimerStart_1);
Close All(DataTables,"No Save");
// Approach 2
TimerStart_2 = Tick Seconds();
Mat_1 = Random Index(10^8,10^a);
Mat_2 = Random Index(10^8,10^a);
Add = Mat_1 + Mat_2 ;
Difference = Mat_1 - Mat_2 ;
Prod =E Mult (Mat_1,Mat_2);
Mod = Mod(Mat_1,Mat_2);
TimerEnd_2 = Tick Seconds();
Show(TimerEnd_2 - TimerStart_2);
// Approach 3
TimerStart_3 = Tick Seconds();
dt1 = New Table("Approach-3","Invisible", add rows(10^a),
New Column("Random-1",Numeric,Continuous ),
New Column("Random-2",Numeric,Continuous ),
New Column("Add",Numeric,Continuous ),
New Column("Subtract",Numeric,Continuous ),
New Column("Multiply",Numeric,Continuous ),
New Column("Mod",Numeric,Continuous)
);
// Column(dt1, "Random-1") << Set Values( Random Index(10^8,10^a) );
// Column(dt1, "Random-2") << Set Values( Random Index(10^8,10^a) );
// Column(dt1, "Add") << Set Values( dt1[0,"Random-1"]+ dt1[0,"Random-2"] );
// Column(dt1, "Subtract") << Set Values( dt1[0,"Random-1"]- dt1[0,"Random-2"] );
// Column(dt1, "Multiply") << Set Values( dt1[0,"Random-1"]:* dt1[0,"Random-2"] );
// Column(dt1, "Mod") << Set Values( Mod(dt1[0,"Random-1"], dt1[0,"Random-2"]) );
dt1[0,"Random-1"] = Random Index(10^8,10^a) ;
dt1[0,"Random-2"] = Random Index(10^8,10^a) ;
dt1[0, "Add"] = dt1[0,"Random-1"]+ dt1[0,"Random-2"] ;
dt1[0, "Subtract"] = dt1[0,"Random-1"]- dt1[0,"Random-2"] ;
dt1[0, "Multiply"] = dt1[0,"Random-1"]:* dt1[0,"Random-2"] ;
dt1[0, "Mod"] = Mod(dt1[0,"Random-1"], dt1[0,"Random-2"] );
TimerEnd_3 = Tick Seconds();
Show(TimerEnd_3 - TimerStart_3);
Close All(DataTables,"No Save");
You have 2 questions: is this fair? and do you favor one over another?
Regarding fair, memory management is all up to the language provider. Keep in mind tables have more methods than matrices. I would have phrased your first questions as, "Is this change in preformance between 1 million and 10 million expected?" Also, I am just guessing that after 1 million rows, JMP might be doing some storage compression, in other words saving memory, the trade-off being time.
What I favor depends upon the task. If I am doing a simulation, where a method is applied numerous times and I am getting summary performance, then I'll use a matrix. Prior to JMP 13 which has better subtable referencing, if I had large tables, I would work with matrices then set the values to the table specifically for performance, and I would never use formulas for large tables.
I have attached a script that appends to your script a third method that uses the JMP 13 subtable referencing syntax. The snippet below is the syntax for the Add column. [This site would not allow me to post the attachment, so go to the end to see the fill script.]
dt1[0, "Add"] = dt1[0,"Random-1"]+ dt1[0,"Random-2"] ;
Note since you are interested in performance, try this,
a=5;
tb1 = TickSeconds();
Mat_1 = Random Index(10^8,10^a);
te1 = TickSeconds();
tb2 = TickSeconds();
Mat_2 = Round (J(10^a, 1, Random Uniform(10^8) )*10^8,0);
te2 = TickSeconds();
show(te1-tb1, te2-tb2);
For a<7, the second method is far superior to method 1.
a = 6: te1 - tb1 = 2.43333333334886; te2 - tb2 = 0.25; //second mehod superior
a = 7: te1 - tb1 = 2.48333333339542; te2 - tb2 = 2.56666666665114; //both methods the same
a = 8: te1 - tb1 = 3.03333333344199; te2 - tb2 = 25.8166666666511; //second method much worse
And like you when there is a big difference, I send a note to JMP as an FYI.
For people running the script beware it is closing all tables, etc. Run in a new session of JMP.
Clear Log(); Clear Globals(); Close All(DataTables,"No Save");
// Inputs
a = 7;
// Approach 1
TimerStart_1 = Tick Seconds();
dt1 = New Table("Approach-1","Invisible");
dt1 << New Column("Random-1",Numeric,Continuous,<< Set Values(Random Index(10^8,10^a)))
<< New Column("Random-2",Numeric,Continuous,<< Set Values(Random Index(10^8,10^a)))
<< New Column("Add",Numeric,Continuous,Formula(:Name("Random-1")+:Name("Random-2")))
<< New Column("Subtract",Numeric,Continuous,Formula(:Name("Random-1")-:Name("Random-2")))
<< New Column("Multiply",Numeric,Continuous,Formula(:Name("Random-1")*:Name("Random-2")))
<< New Column("Mod",Numeric,Continuous,Formula(Mod(:Name("Random-1"),:Name("Random-2"))));
TimerEnd_1 = Tick Seconds();
Show(TimerEnd_1 - TimerStart_1);
Close All(DataTables,"No Save");
// Approach 2
TimerStart_2 = Tick Seconds();
Mat_1 = Random Index(10^8,10^a);
Mat_2 = Random Index(10^8,10^a);
Add = Mat_1 + Mat_2 ;
Difference = Mat_1 - Mat_2 ;
Prod =E Mult (Mat_1,Mat_2);
Mod = Mod(Mat_1,Mat_2);
TimerEnd_2 = Tick Seconds();
Show(TimerEnd_2 - TimerStart_2);
// Approach 3
TimerStart_3 = Tick Seconds();
dt1 = New Table("Approach-3","Invisible", add rows(10^a),
New Column("Random-1",Numeric,Continuous ),
New Column("Random-2",Numeric,Continuous ),
New Column("Add",Numeric,Continuous ),
New Column("Subtract",Numeric,Continuous ),
New Column("Multiply",Numeric,Continuous ),
New Column("Mod",Numeric,Continuous)
);
// Column(dt1, "Random-1") << Set Values( Random Index(10^8,10^a) );
// Column(dt1, "Random-2") << Set Values( Random Index(10^8,10^a) );
// Column(dt1, "Add") << Set Values( dt1[0,"Random-1"]+ dt1[0,"Random-2"] );
// Column(dt1, "Subtract") << Set Values( dt1[0,"Random-1"]- dt1[0,"Random-2"] );
// Column(dt1, "Multiply") << Set Values( dt1[0,"Random-1"]:* dt1[0,"Random-2"] );
// Column(dt1, "Mod") << Set Values( Mod(dt1[0,"Random-1"], dt1[0,"Random-2"]) );
dt1[0,"Random-1"] = Random Index(10^8,10^a) ;
dt1[0,"Random-2"] = Random Index(10^8,10^a) ;
dt1[0, "Add"] = dt1[0,"Random-1"]+ dt1[0,"Random-2"] ;
dt1[0, "Subtract"] = dt1[0,"Random-1"]- dt1[0,"Random-2"] ;
dt1[0, "Multiply"] = dt1[0,"Random-1"]:* dt1[0,"Random-2"] ;
dt1[0, "Mod"] = Mod(dt1[0,"Random-1"], dt1[0,"Random-2"] );
TimerEnd_3 = Tick Seconds();
Show(TimerEnd_3 - TimerStart_3);
Close All(DataTables,"No Save");
@gzmorgan0,
Thank you for your detailed response. Your interpretation is mostly accurate and I wish I provided more clarity to begin with. I agree and share your preferences between data tables and matrices, would use data tables if I needed more built in methods vs matrices. However, the question I wanted to pick the communities brain on was the speed of handling large data and if and why does the behavior change as the data size increases. I would like to believe it is because of the storage compression that you are referring to.
One interesting aspect is the amount of time that is saved via data table sub-scripting. At a = 7, it was shaving a good 7 seconds w.r.t to the traditional column formula approach - with matrices still leading in terms of performance.
"Making the data table private, might shave some more time"
- actually, you might find that making the table private has a substantial impact on performance
@David_Burnham,
While I agree, I am generating multiple tables iteratively that the reference gets overwritten and hence making the data table private might result in loss of the table. But in general, I agree and follow the approach of making the data table private where ever possible.