Hi all,
Been a while since my last post, and in that time I think I have become a little bit more proficient at JMP-ing. Shamelessly, using the booming AI-LLMs to self-teach JSL has been working a treat. However, I am running into an issue that has been plaguing me for a very long time with JMP.
As the title states, it stems from concatenation of large tables without duplicating values. I am on JMP 18. I have attached two test data tables (with obviously a very limited and scrubbed data set, the real ones includes >100,000 rows, >30 columns). These should probably be attached to a project, oops.
Let me explain the background (lots of context for ultimately a very direct question, apologies):
I have a reasonably developed workflow for regular data analysis. I have a sample tracking table, with sample IDs and their associated information. I have a general data table - I populate this with information from my tests. The two are linked using a virtual join, with sample ID as the key. In the sample info table, each Sample ID is unique. In the cycle data table each Sample ID has multiple rows of data, each with (ideally) a unique Cycle ID. Each unique combination of Sample ID and Cycle ID has associated data (various performance metrics).
A test has multiple cycles - in fact, a test is live and the cycle number and its associated data is constantly updated. Periodically, I export the cycle data for each sample, importing it into JMP and then concatenate the two tables - the latest snapshot, and the bulk data table. This process creates many many duplicates. For example, let's take sample A. A in the bulk data table already has Cycle ID 1 -> 1000. In the latest snapshot, A will have data for Cycle ID 1 -> 1200. I concatenate my tables and the bulk data table will have, for sample A, Cycle ID 1 -> 1000 *2, and 1001 -> 1200 *1.
For my personal workflow, this is not ideal, but it's fine. I select all the columns, select all duplicate rows and delete them. Not optimal, but whatever.
The issue actually stems from what I've been working on recently. The way I construct my bulk data, and the combination with virtual joins is a very effective strategy for data filtration. It has been noticed by my team, and through some effort I have persuaded people to adopt similar strategies. However, I am the most advanced JMP user in our group. The others do not have the same level of time or commitment to learning to deal with JMP/JSL intricacies. This is fine, it is a responsibility I welcome.
But. I want them to have access to my ability to process data.
I have come up with a solution. The idea is that I will maintain the bulk data table as an external table. I will update it using an admin-esque project. I have written scripts that auto-initialise on their projects, defining functions to automatically refresh a local embedded copy of the master file into their projects. These work fine - I can successfully call on a snapshot of the master data table in other people's projects! Great, I'm very happy with that. The issue is that the most logical approach is to not actually allow them to work on that snapshot. The script I have written it will delete that table every time it refreshes, and generate a new identical table with the same name in it's place. I want to keep that very much the case - it will stop users from having 30 copies of the master table at various snapshot moments with different names. The next step is then to update the user's own working data table with the new data from the master copy.
I'm trying to write a script to achieve that. I need it to update the working table while preserving whatever it is the users have added themselves.
I have tried to use the Update () function, but as far as I can work out that only replaces existing data with new data. I cannot for the life of me figure out a way to have that add new rows. Join adds new rows but also creates a new table. Neither of the solutions work for me. The only way I can figure out is to concatenate. But I don't want the users to deal with duplicate rows. The duplicates can also be quite serious - the data tables are hundreds of thousands of rows, and there could be as many as 50,000 duplicates at a time.
My approach right now (JSL included) is to create temporary columns in the master and the working table. These concatenate the sample ID || Cycle ID, creating a unique single column key, let's call it keyKey. I then want to create a second column in the master table. I want this column to compare the keyKey in table 1 and table 2, let's call it the Selector column. If keyKey exists in both tables, I want the value in the Selector column to be 0, if it only exists in the master, I want the value to be 1. I then want the script to select all rows with a value of 1, create a temporary subset table based on that selection, then concatenate that subset to the working table. Boom. Duplicate free concatenation. BUT. I for the life of me cannot figure out how how to compare keyKey between two tables and use that comparison to generate a column value.
I am desperately trying to avoid using For Each Row () loops. I don't want that memory load for data tables over 100,000 row size.
Names Default To Here(1);
// --- Configuration ---
workingTableName = "Working_Data_Test";
readOnlyMasterName = "Master_Data_Test";
keyColA = "Cell ID";
keyColB = "Cycle ID";
// --- Get Handles & Perform Error Checks ---
dtWorking = Data Table( workingTableName );
dtMaster = Data Table( readOnlyMasterName );
If( Is Empty( dtWorking ) | Is Empty( dtMaster ), Throw("Ensure both tables are open.") );
// --- STEP 1: Update existing rows---
//Print( "Step 1: Updating existing rows..." );
//dtWorking << Update(
// With( dtMaster ),
// Match Columns( :Name(keyCols[1]) = :Name(keyCols[1]), :Name(keyCols[2]) = :Name(keyCols[2]) )
//);
// --- STEP 2: Find new rows using a temporary concatenated key ---
Print( "Step 2: Finding new rows using explicit For Each Row loop..." );
// Add the blank temporary key columns to both tables first.
dtWorking << New Column( "~temp_key~", Character );
dtMaster << New Column( "~temp_key~", Character );
// Populate the key for the Working Table
For Each Row( dtWorking,
:Name("~temp_key~") = :Name("Cell ID") || "|" || Char(:Name("Cycle ID"));
);
// Populate the key for the Master Table
For Each Row( dtMaster,
:Name("~temp_key~") = :Name("Cell ID") || "|" || Char(:Name("Cycle ID"));
);
// Create a fast lookup list of all temporary keys in the working table.
workingKeys = Associative Array( dtWorking:Name("~temp_key~") );
//Add a new "flag" column to the master table.
dtMaster << New Column( "~is_match~", Numeric
//Set its value using a formula. The formula returns 0 if the key is new, 1 if it's a match.
Column( dtMaster, "~is_match~" ) << Set Formula(
If( Is Missing( workingKeys[ :Name("~temp_key~") ] ),
0, // Not found in workingKeys -> It's a new row
1 // Found in workingKeys -> It's an existing row
)
);
// --- STEP 3: Create a new temporary table from the selected rows ---
If( N Rows( dtMaster << Get Selected Rows ) > 0,
Print( "Found " || Char( N Rows( dtMaster << Get Selected Rows ) ) || " new rows. Creating temporary table..." );
dtTempNewRows = dtMaster << Subset(
Selected Rows( 1 ),
Output Table Name( "Temp New Rows" )
);
dtTempNewRows << Show Window(1);
,
Print("No new rows found to add.");
);
// --- STEP 4: Clean up temporary columns ---
//Try( dtWorking << Delete Columns( "~temp_key~" ) );
//Try( dtMaster << Delete Columns( "~temp_key~" ) );
//dtMaster << Clear Select;
Print( "Test complete." );
Also if someone could help me figure out how to use string variables for column name identification (Step 1, commented out because I also couldn't figure out how to fix that), that would be great.
Anyway, thanks for reading this behemoth of a post,
Late