On a regular base I hear from JMP users, that they want to update the data table with the new data rather than creating a completely new data table. This can have a variety of reasons: you may have done some data cleaning or recoding already, want to keep your work and previous analysis saved to the data table, or you just simply need to add new data to the existing data table.
In Part 1 of this two-article series I talked about the first scenario - a vertical structured data in a spreadsheet. This is not too complicated, import the new data and concatenate it to the original one (considering the rows already imported).
Today I want to talk about a second scenario, which is a bit more complicated. I see this case quite often when looking at customer’s data coming from e.g. labs. The horizontal structured data table:
That is first row/first column is the parameter name, in the following columns is the data. In the second row are the units, in the third row the batch number. Then an empty row and after that the same thing again for another parameter. This is easier to read in a spreadsheet but eventually not very easy to analyze. I often see these kinds of data, although it would be much better to train people to enter data in the vertical way than trying to work around with the horizontal structure.
How can you import and update a table like this in JMP? What’s so difficult with that data?
We have to understand the challenges.
For a data analysis this kind of structure is nothing a machine would love, you would need to specifically tell where what data is located and do the appropriate analysis. If new data is added to the right, you need to adjust the analysis’ data ranges appropriately.
Due to these challenges we need to process the data before we come to where we want to be.
Let's start to create the template table we want to update
1. Access the current data from the spreadsheet as it is:
2. This is actually not the table we want to have, so we rename it to “help” and need to remove the empty rows, as preparation for transposing the data, in coding this can look like this:
dt_h1 << Set Name( name_worksheet || "help" ); // Remove rows with only missing values (break lines between measurements) n = Loc( Column( 1 ) << Get Values, "" ); dt_h1 << Select Rows( n ); dt_h1 << Delete Rows;
3. The first column has the actual column names we want to have in the final data. We want to save them for later usage. Then we can transpose the table, provide it the final table name (as the worksheet – as I will refer to the table name to automatically find the corresponding worksheet in the spreadsheet):
// Get names used for column names after transposing the data table NewNames = dt_h1:Column 1 << Get Values; // Transpose Column dt_final = Data Table( name_worksheet || "help" ) << Transpose( columns( 1 :: N Cols( dt_h1 ) ), Label column name( "ColName" ), Output Table( name_worksheet ) );
4. We now need to make some changes to the data table:
The final table should look like this:
Now let’s take care of the update script
Most of what you need has been done already and saved into scripts automatically:
Let’s take a look at the final update script we will put into this template table as a script:
1. Import worksheet with name of the current data table
// Read in Sheet2 of the Excel workbook excelFile = "TransposedExcelWorksheet.xlsx"; // Get the name of the data table and number of existing rows cdt = Current Data Table(); cdtName = cdt << get name(); n_Rows = N Rows(); // Import the data invisible dt_h1 = Open( excelFile, Worksheets( cdtName ), Use for all sheets( 1 ), Concatenate Worksheets( 0 ), Create Concatenation Column( 0 ), Worksheet Settings( 1, Has Column Headers( 0 ), Number of Rows in Headers( 1 ), Headers Start on Row( 1 ), Data Starts on Row( 1 ), Data Starts on Column( 1 ), Data Ends on Row( 0 ), Data Ends on Column( 0 ), Replicated Spanned Rows( 1 ), Replicated Spanned Headers( 0 ), Suppress Hidden Rows( 1 ), Suppress Hidden Columns( 1 ), Suppress Empty Columns( 1 ), Treat as Hierarchy( 0 ), Multiple Series Stack( 0 ), Import Cell Colors( 0 ), Limit Column Detect( 0 ), Column Separator String( "-" ) ),Invisible );
2. Make modifications as described above for the template table, transpose and clean up the data table
dt_h1 << Set Name( cdtName || "help" ); // Remove rows with only missing values (break lines between measurements) n = Loc( Column( 1 ) << Get Values, "" ); dt_h1 << Select Rows( n ); dt_h1 << Delete Rows; // Get names used for column names after transposing the data table NewNames = dt_h1:Column 1 << Get Values; // Transpose Column dt_final = Data Table( cdtName || "help" ) << Transpose( columns( 1 :: N Cols( dt_h1 ) ), Label column name( "ColName" ), Output Table( cdtName ) ); // Clean up data table - delete first row of column names and first column (labels) dt_final << Select Rows( 1 ); dt_final << Delete Rows(); Column( dt_final, "ColName" ) << Set Selected; dt_final << Delete Columns(); // Set the Column Names appropriately based on deleted first row in this table For( i = 1, i <= N Cols( dt_final ), i++, col = Column( i ); col << Set Name( NewNames[i] ); );
3. Remove rows already imported and append rest to the original data table with concatenate:
// remove rows already imported dt_final << Select Rows( 1 :: n_Rows ); dt_final << Delete Rows(); // Append new data to original data table cdt << Concatenate( dt_final, Append to first table ); // Close help data table Close( dt_final ); Close( dt_h1 );
// this last bracket is to close the whole script! );
Adding this script to the template data table will allow you to simply press one button to get the new data from a horizontal structured data in a spreadsheet.
This involved quite a bunch of steps, some easy to do, some more advanced. The nice thing is that you can do it in JMP. However, always remember:
Data preparation is what requires most time in extracting insights from your data!
If you want to work with the data, use a structure tailored for data analysis.
That is most of the time a vertical approach, where each column stands for one parameter. But also a row based approach can be done, but then please in a correct one.
To provide two examples which would ease the analysis:
Of course you can use this and extend it by adding analysis to the data, like updating a control chart or other trend analysis. Using Windows Scheduler you might want to then update your report and send it to our new product JMP Live, so other people (not necessarily licensed JMP users) can access the reports in a web browser with the interactivity of JMP like data filter, column switcher, and so forth.
Please feel free to add your use cases and or approaches to do something similar. I’m happy to take a look at.
All the best,
PS: Three comments to the files:
There are no labels assigned to this post.