Solved: Re: Overlay Graphs

sophiaw · Aug 18, 2015 05:56 PM

Hi,

I've run into a problem when it comes to stacking columns. The data that I look at can get into the hundreds of columns of data. Often times I stack columns together so I can use a distribution graph on that one column. But sometimes my data table gets into the millions of rows of data and it starts to crash when I stack columns. I was wondering if there is a way I can overlay a bunch of distribution graphs to make one graph without having to stack all the columns together? Like graph a ton of columns and overlay all those graphs into one?

Thank you for any help!

ian_jmp · Aug 21, 2015 06:30 AM

Within the hardware constraints of your machine, generally JMP is good at this kind of thing. I assume that (for some reason) it's not possible to import the data in a single column.

You can experiment with the code below. On my machine, this takes eight seconds, but more than half of that time is spent building the table to start from, and evaluating the formulae therein. Setting nr = 1,000,0000 gives a run time of about one minute.

NamesDefaultToHere(1);
// Number of columns and rows
nc = 130;
nr = 100000;
// Make a test table, dt1, to work with
dt1 = NewTable("Test", NewColumn("Column 1", Numeric, Continuous, Formula(RandomNormal())),AddRows(nr));
For(i=2, i<=nc, i++, dt1 << NewColumn("Column "||Char(i), Numeric, Continuous, Formula(RandomNormal())));
// Get the data from the table into a matrix
m = dt1 << getAsMatrix;
// Close dt1 to recover some memory
Close(dt1, NoSave);
// Reshape into a column vector
m = Shape(m, NRow(m)*NCol(m), 1);
// Make a new table
dt2 = NewTable("Test in a Single Column", NewColumn("All", Numeric, Continuous, Values(m)));
// Recover memory
m = [];

View solution in original post

Craige_Hales · Oct 18, 2016 5:25 PM

See if Graph Builder will do what you want. Select all the columns and drop them on the Y axis. You'll see a point cloud. Here's Big Class using age, height, weight:

then right-click->add->histogram:

then right-click->points->remove, click DONE

click the titles and press delete (Or add a better title)

then you can use the red triangle to copy the script:


Graph Builder(

  Size( 534, 418 ),

  Show Control Panel( 0 ),

  Variables( Y( :age ), Y( :height, Position( 1 ) ), Y( :weight, Position( 1 ) ) ),

  Elements( Histogram( Y( 1 ), Y( 2 ), Y( 3 ), Legend( 3 ) ) ),

  SendToReport(

  Dispatch( {}, "graph title", TextEditBox, {Set Text( "Students" )} ),

  Dispatch( {}, "X title", TextEditBox, {Set Wrap( 2 )} ),

  Dispatch( {}, "Y title", TextEditBox, {Set Text( "" )} ),

  Dispatch(

  {},

  "Graph Builder",

  FrameBox,

  {DispatchSeg(

  Hist Seg( "Histogram (age)" ),

  Histogram Color( -4222943 )

  ), DispatchSeg(

  Hist Seg( "Histogram (height)" ),

  Histogram Color( -13977687 )

  ), DispatchSeg(

  Hist Seg( "Histogram (weight)" ),

  Histogram Color( -3780931 )

  )}

  )

  )

)

Craige

sophiaw · Aug 19, 2015 12:54 PM

That is helpful, but it's not quite what I'm looking for. I want it to essentially look exactly like the distribution graph (where you select a column and graph it) except that I want to be able to select multiple columns and create a distribution. I just want one graph with the values along the bottom, and how much data on the y-axis. Right now I have to stack all the columns I want to graph together and then graph that new column (which is exactly what I want). The only problem is I have about 130 columns I want to graph together and stacking them is unreasonable, especially when my data table is usually in the 100s of thousands of rows and stacking two columns together duplicates the amount of rows; stacking 130 would crash JMP or take an enormous amount of time. The graph builder essentially tells me nothing because there is no x-axis in this case, so I don't know how much each bar represents or anything.

Is there a way to overlay the distribution graphs to make one graph of all the columns instead of separating out every column?

Thanks!

Craige_Hales · Oct 18, 2016 5:26 PM

This should stack your data without duplicating too much. It deletes the label column since you don't care which original column the values came from. DropAllOtherColumns may be the part you are looking for.


newdt = Data Table( "Big Class" ) << Stack(

  columns( :height, :weight, :age ),

  Source Label Column( "Label" ),

  Stacked Data Column( "Data" ),

  Drop All Other Columns( 1 ),

  Output Table( "hw" )

);

newdt << deletecolumns( "label" );

newdt << Distribution( Continuous Distribution( Column( :Data ) ) );

Craige

pmroz · Aug 19, 2015 03:36 PM

The rule of thumb that I've heard about JMP is to have twice as much memory available as your dataset size. So maybe more memory is needed?

ms · Aug 19, 2015 05:21 PM

I don't think that kind of overlay is possible in the distribution platform. In my experience stacking is quite fast. The below example code runs in a few seconds on my laptop. If increasing the number of data to 10^8 it slows down, but it's the distribution platform that's the bottleneck here, not the stacking.

On another note; a histogram of millions of data looks not much different from a smaller random subset of the data. So if performance is a problem you could try to downsize the data set before visualization (or you may simply need more RAM).

dt = New Table("big");
nr = 100000;
nc = 100;
dt << add rows(nr);
For(i = 1, i <= nc, i++,
    dt << New Column("col" || Char(i), set each value(Random Normal()))
);
dt_stacked = dt << Stack(
    columns(dt << get column names),
    Source Label Column("Label"),
    Stacked Data Column("Data"),
    Drop All Other Columns(1)
);
dt_stacked << Distribution(Continuous Distribution(Column(:Data), Outlier Box Plot(0)));

Peter_Bartell · Oct 18, 2016 5:25 PM

sophiaw: In JMP version 12 you can create the visualization below (BigClass) Height by Sex (overlay).

I think this is the visualization you are after? But unfortunately you have to stack the Overlay variable and you say that is problematic from a data table management point of view. How much RAM do you possess? Maybe adding more RAM can help with the crashing? All that aside, with as many levels of the overlay variable as you suggest I'm just wondering just how visually aesthetically appealing your graph will be? Let alone being able to really tell which level is which. The histogram overlay visualization is really most effective with just a relatively small number of levels for the overlay variable.

sophiaw · Aug 20, 2015 03:00 PM

Thank you for all of your suggestions!

Unfortunately, it's not really what I'm looking for. It seems that JMP just can't do what I want. Because I don't want to have to stack all the columns together, but I want the distribution graph of all the data.

DropAllOtherColumns is a useful function I did not know about. I'll have to play around with that to see if stacking my columns and adding that functions improves efficiency.

Also, Peter, I don't care about seeing what value corresponds to which level in the distribution, which is why I want something equivalent to stacking, but without the memory sucking aspect. But that distribution graph looks a lot like what I'm after, unfortunately I don't have JMP version 12. I also have 130 variables. haha

Thank you!

ian_jmp · Aug 21, 2015 06:30 AM

Within the hardware constraints of your machine, generally JMP is good at this kind of thing. I assume that (for some reason) it's not possible to import the data in a single column.

You can experiment with the code below. On my machine, this takes eight seconds, but more than half of that time is spent building the table to start from, and evaluating the formulae therein. Setting nr = 1,000,0000 gives a run time of about one minute.

NamesDefaultToHere(1);
// Number of columns and rows
nc = 130;
nr = 100000;
// Make a test table, dt1, to work with
dt1 = NewTable("Test", NewColumn("Column 1", Numeric, Continuous, Formula(RandomNormal())),AddRows(nr));
For(i=2, i<=nc, i++, dt1 << NewColumn("Column "||Char(i), Numeric, Continuous, Formula(RandomNormal())));
// Get the data from the table into a matrix
m = dt1 << getAsMatrix;
// Close dt1 to recover some memory
Close(dt1, NoSave);
// Reshape into a column vector
m = Shape(m, NRow(m)*NCol(m), 1);
// Make a new table
dt2 = NewTable("Test in a Single Column", NewColumn("All", Numeric, Continuous, Values(m)));
// Recover memory
m = [];