Discussions

PerenniallyLate · Jun 26, 2025 6:47 AM

Hi all,

Been a while since my last post, and in that time I think I have become a little bit more proficient at JMP-ing. Shamelessly, using the booming AI-LLMs to self-teach JSL has been working a treat. However, I am running into an issue that has been plaguing me for a very long time with JMP.

As the title states, it stems from concatenation of large tables without duplicating values. I am on JMP 18. I have attached two test data tables (with obviously a very limited and scrubbed data set, the real ones includes >100,000 rows, >30 columns). These should probably be attached to a project, oops.

Let me explain the background (lots of context for ultimately a very direct question, apologies):

I have a reasonably developed workflow for regular data analysis. I have a sample tracking table, with sample IDs and their associated information. I have a general data table - I populate this with information from my tests. The two are linked using a virtual join, with sample ID as the key. In the sample info table, each Sample ID is unique. In the cycle data table each Sample ID has multiple rows of data, each with (ideally) a unique Cycle ID. Each unique combination of Sample ID and Cycle ID has associated data (various performance metrics).

A test has multiple cycles - in fact, a test is live and the cycle number and its associated data is constantly updated. Periodically, I export the cycle data for each sample, importing it into JMP and then concatenate the two tables - the latest snapshot, and the bulk data table. This process creates many many duplicates. For example, let's take sample A. A in the bulk data table already has Cycle ID 1 -> 1000. In the latest snapshot, A will have data for Cycle ID 1 -> 1200. I concatenate my tables and the bulk data table will have, for sample A, Cycle ID 1 -> 1000 *2, and 1001 -> 1200 *1.

For my personal workflow, this is not ideal, but it's fine. I select all the columns, select all duplicate rows and delete them. Not optimal, but whatever.

The issue actually stems from what I've been working on recently. The way I construct my bulk data, and the combination with virtual joins is a very effective strategy for data filtration. It has been noticed by my team, and through some effort I have persuaded people to adopt similar strategies. However, I am the most advanced JMP user in our group. The others do not have the same level of time or commitment to learning to deal with JMP/JSL intricacies. This is fine, it is a responsibility I welcome.

But. I want them to have access to my ability to process data.

I have come up with a solution. The idea is that I will maintain the bulk data table as an external table. I will update it using an admin-esque project. I have written scripts that auto-initialise on their projects, defining functions to automatically refresh a local embedded copy of the master file into their projects. These work fine - I can successfully call on a snapshot of the master data table in other people's projects! Great, I'm very happy with that. The issue is that the most logical approach is to not actually allow them to work on that snapshot. The script I have written it will delete that table every time it refreshes, and generate a new identical table with the same name in it's place. I want to keep that very much the case - it will stop users from having 30 copies of the master table at various snapshot moments with different names. The next step is then to update the user's own working data table with the new data from the master copy.

I'm trying to write a script to achieve that. I need it to update the working table while preserving whatever it is the users have added themselves.

I have tried to use the Update () function, but as far as I can work out that only replaces existing data with new data. I cannot for the life of me figure out a way to have that add new rows. Join adds new rows but also creates a new table. Neither of the solutions work for me. The only way I can figure out is to concatenate. But I don't want the users to deal with duplicate rows. The duplicates can also be quite serious - the data tables are hundreds of thousands of rows, and there could be as many as 50,000 duplicates at a time.

My approach right now (JSL included) is to create temporary columns in the master and the working table. These concatenate the sample ID || Cycle ID, creating a unique single column key, let's call it keyKey. I then want to create a second column in the master table. I want this column to compare the keyKey in table 1 and table 2, let's call it the Selector column. If keyKey exists in both tables, I want the value in the Selector column to be 0, if it only exists in the master, I want the value to be 1. I then want the script to select all rows with a value of 1, create a temporary subset table based on that selection, then concatenate that subset to the working table. Boom. Duplicate free concatenation. BUT. I for the life of me cannot figure out how how to compare keyKey between two tables and use that comparison to generate a column value.
I am desperately trying to avoid using For Each Row () loops. I don't want that memory load for data tables over 100,000 row size.

Names Default To Here(1);
// --- Configuration ---
workingTableName = "Working_Data_Test";
readOnlyMasterName = "Master_Data_Test";
keyColA = "Cell ID";
keyColB = "Cycle ID";

// --- Get Handles & Perform Error Checks ---
dtWorking = Data Table( workingTableName );
dtMaster = Data Table( readOnlyMasterName );
If( Is Empty( dtWorking ) | Is Empty( dtMaster ), Throw("Ensure both tables are open.") );

// --- STEP 1: Update existing rows---
//Print( "Step 1: Updating existing rows..." );
//dtWorking << Update(
// With( dtMaster ),
// Match Columns( :Name(keyCols[1]) = :Name(keyCols[1]), :Name(keyCols[2]) = :Name(keyCols[2]) )
//);

// --- STEP 2: Find new rows using a temporary concatenated key ---
Print( "Step 2: Finding new rows using explicit For Each Row loop..." );

// Add the blank temporary key columns to both tables first.
dtWorking << New Column( "~temp_key~", Character );
dtMaster << New Column( "~temp_key~", Character );
// Populate the key for the Working Table
For Each Row( dtWorking,
:Name("~temp_key~") = :Name("Cell ID") || "|" || Char(:Name("Cycle ID"));
);
// Populate the key for the Master Table
For Each Row( dtMaster,
:Name("~temp_key~") = :Name("Cell ID") || "|" || Char(:Name("Cycle ID"));
);

// Create a fast lookup list of all temporary keys in the working table.
workingKeys = Associative Array( dtWorking:Name("~temp_key~") );

//Add a new "flag" column to the master table.
dtMaster << New Column( "~is_match~", Numeric

//Set its value using a formula. The formula returns 0 if the key is new, 1 if it's a match.
Column( dtMaster, "~is_match~" ) << Set Formula(
If( Is Missing( workingKeys[ :Name("~temp_key~") ] ),
0, // Not found in workingKeys -> It's a new row
1 // Found in workingKeys -> It's an existing row
)
);

// --- STEP 3: Create a new temporary table from the selected rows ---
If( N Rows( dtMaster << Get Selected Rows ) > 0,
Print( "Found " || Char( N Rows( dtMaster << Get Selected Rows ) ) || " new rows. Creating temporary table..." );
dtTempNewRows = dtMaster << Subset(
Selected Rows( 1 ),
Output Table Name( "Temp New Rows" )
);

dtTempNewRows << Show Window(1);
,
Print("No new rows found to add.");
);

// --- STEP 4: Clean up temporary columns ---

//Try( dtWorking << Delete Columns( "~temp_key~" ) );
//Try( dtMaster << Delete Columns( "~temp_key~" ) );
//dtMaster << Clear Select;


Print( "Test complete." );

Also if someone could help me figure out how to use string variables for column name identification (Step 1, commented out because I also couldn't figure out how to fix that), that would be great.

Anyway, thanks for reading this behemoth of a post,

Late

PerenniallyLate · Jun 30, 2025 06:38 AM

No, you're right. It was premature optimisation, and yeah the concatenate and delete duplicates does work with our data set, despite it creating several hundred thousand duplicates to delete. It just seems like such an ugly way to achieve what in my opinion is quite a basic function.

While I get the idea behind the join approach, surely it is still blocked by Update() not adding new rows? Unless I am misunderstanding something, could you clarify?

Final scrubbed code below for anyone curious:

// ** Working table update script **
// Includes lockfile check to prevent reading a mid-update master table.
//V1.1 Includes ability to deal sensibly with changing timestamps while preserving deletion of duplicates
Names Default To Here(1);

// --- Configuration ---
workingTableName   = "Test_WorkingTable"; //Change this to the EXACT name of the table you want to update
keyColA = "Sample ID";
keyColB = "Test ID";

// --- Path Definitions ---
// Build the path to the master file and lock file on the shared drive.
homeDir = Get Environment Variable( "USERPROFILE" );
If( Is Empty( homeDir ), homeDir = Get EnvironmentVariable( "HOME" ) );
sharePointRelativePath = "\OUR\INTERNAL\FILEPATH\";
masterFilePath = homeDir || sharePointRelativePath || "Data_File_Name.jmp";
lockFile       = homeDir || sharePointRelativePath || "lockfile.txt";

// --- 1. Check for Lock File ---
// If the lockfile exists, stop all execution immediately.
If( File Exists( lockFile ),
    Print( "UPDATE FAILED: The master data file is currently locked by an admin for updates." );
    Print( "Please try again later." );
    Stop(); // Halt the script
);


// --- 2. Get Handle to User's Working Table ---
dtWorking = Data Table( workingTableName );
If( Is Empty( dtWorking ),
    Throw( "UPDATE ERROR: Your working table '" || workingTableName || "' could not be found. Please ensure it is open." )
);


// --- 3. Open a Fresh, Unlinked Copy of the Master Data ---
dtMaster = Open( masterFilePath, invisible );
If( Is Empty( dtMaster ), Throw("Could not open master data file from shared drive.") );

// ---3.5. Deal with the timestamp table variable. It will return an empty string if it doesn't exist, and if it does exist, delete the variable.---
// A. Read the timestamp from the master table into a script variable.
newTimestamp = dtMaster << Get Table Variable( "Last Updated" );

// If the variable was found, print it to the log.
If( !Is Empty( newTimestamp ),
    Print( "--- Master data was last updated on: " || newTimestamp || " ---" );
,
    Print( "--- This master data does not contain a 'Last Updated' timestamp. ---" );
);

// B. Delete the variable from the master table so it's a clean data source.
// We wrap in Try() in case an old master file doesn't have the variable.
Try( dtMaster << Delete Table Variable( "Last Updated" ) );

// --- 4. Concatenate the Clean Master to the Working Table ---
Print("Step 1: Appending master data to working data...");
dtWorking << Concatenate(
    dtMaster,
    Append to First Table
);
Close( dtMaster, NoSave ); // The fresh master copy is no longer needed.


// --- 5. Find and Delete Duplicate Rows ---
Print("Step 2: Identifying and removing duplicate rows...");

dtWorking << Select Duplicate Rows(
    Match( As Column(keyColA), As Column(keyColB) )
);

nDuplicates = N Rows( dtWorking << Get Selected Rows );
If( nDuplicates > 0,
    Print("Found and deleting " || Char(nDuplicates) || " duplicate rows.");
    dtWorking << Delete Rows;
);

dtWorking << Clear Select;
// 6. If a new timestamp was found, set it on the working table.
If( !Is Empty( newTimestamp ),
    dtWorking << Set Table Variable( "Last Updated", newTimestamp );
    Print( "Working table timestamp updated to: " || newTimestamp );
);

Print( "Process complete. Final table has " || Char(N Rows(dtWorking)) || " rows." );
Window( workingTableName ) << Bring Window to Front;

View solution in original post

jthi · Jun 26, 2025 10:13 AM

Could you use Join -> create new table -> overwrite old table with the newly created joined table if you wish to keep the original table the same?

-Jarmo

mmarchandFSLR · Jun 26, 2025 7:29 AM

If you can't follow Jarmo's suggestion, you could always concatenate and then delete duplicates.

Names Default To Here( 1 );
dt = Data Table( "Master_Data_Test" );
dt2 = Data Table( "Working_Data_Test" );
dt << Concatenate( dt2, Append to first table );
dt << Select Duplicate Rows( Match() );
If( Length( dt << Get Selected Rows ) > 0,  //In case no duplicate rows exist; don't want to delete everything!
	dt << Delete Rows()
);

PerenniallyLate · Jun 26, 2025 10:56 AM

@jthi I want to avoid deleting the user's working table. Hence concatenation being the only viable choice as I understand it so far.

@mmarchandFSLR I suppose that's looking increasingly like the simplest answer even if I'm hesitant about potential for many many duplicates slowing things down. I'm just shocked that it's appearing impossible in JMP to concatenate non-duplicate data, it feels like such a massive function to be missing...

jthi · Jun 26, 2025 12:36 PM

That is why I suggested doing join and then using that joined table to "update" the original table.

JMP doesn't currently have proper upsert (update + insert) functionality, but you can write your own. IT might be slightly complicated to create efficient one though... Also, have you tried if you have any issues with the duplicate selection? Or would that just be premature optimization to try and avoid it?

-Jarmo

PerenniallyLate · Jun 30, 2025 06:38 AM

No, you're right. It was premature optimisation, and yeah the concatenate and delete duplicates does work with our data set, despite it creating several hundred thousand duplicates to delete. It just seems like such an ugly way to achieve what in my opinion is quite a basic function.

While I get the idea behind the join approach, surely it is still blocked by Update() not adding new rows? Unless I am misunderstanding something, could you clarify?

Final scrubbed code below for anyone curious:

// ** Working table update script **
// Includes lockfile check to prevent reading a mid-update master table.
//V1.1 Includes ability to deal sensibly with changing timestamps while preserving deletion of duplicates
Names Default To Here(1);

// --- Configuration ---
workingTableName   = "Test_WorkingTable"; //Change this to the EXACT name of the table you want to update
keyColA = "Sample ID";
keyColB = "Test ID";

// --- Path Definitions ---
// Build the path to the master file and lock file on the shared drive.
homeDir = Get Environment Variable( "USERPROFILE" );
If( Is Empty( homeDir ), homeDir = Get EnvironmentVariable( "HOME" ) );
sharePointRelativePath = "\OUR\INTERNAL\FILEPATH\";
masterFilePath = homeDir || sharePointRelativePath || "Data_File_Name.jmp";
lockFile       = homeDir || sharePointRelativePath || "lockfile.txt";

// --- 1. Check for Lock File ---
// If the lockfile exists, stop all execution immediately.
If( File Exists( lockFile ),
    Print( "UPDATE FAILED: The master data file is currently locked by an admin for updates." );
    Print( "Please try again later." );
    Stop(); // Halt the script
);


// --- 2. Get Handle to User's Working Table ---
dtWorking = Data Table( workingTableName );
If( Is Empty( dtWorking ),
    Throw( "UPDATE ERROR: Your working table '" || workingTableName || "' could not be found. Please ensure it is open." )
);


// --- 3. Open a Fresh, Unlinked Copy of the Master Data ---
dtMaster = Open( masterFilePath, invisible );
If( Is Empty( dtMaster ), Throw("Could not open master data file from shared drive.") );

// ---3.5. Deal with the timestamp table variable. It will return an empty string if it doesn't exist, and if it does exist, delete the variable.---
// A. Read the timestamp from the master table into a script variable.
newTimestamp = dtMaster << Get Table Variable( "Last Updated" );

// If the variable was found, print it to the log.
If( !Is Empty( newTimestamp ),
    Print( "--- Master data was last updated on: " || newTimestamp || " ---" );
,
    Print( "--- This master data does not contain a 'Last Updated' timestamp. ---" );
);

// B. Delete the variable from the master table so it's a clean data source.
// We wrap in Try() in case an old master file doesn't have the variable.
Try( dtMaster << Delete Table Variable( "Last Updated" ) );

// --- 4. Concatenate the Clean Master to the Working Table ---
Print("Step 1: Appending master data to working data...");
dtWorking << Concatenate(
    dtMaster,
    Append to First Table
);
Close( dtMaster, NoSave ); // The fresh master copy is no longer needed.


// --- 5. Find and Delete Duplicate Rows ---
Print("Step 2: Identifying and removing duplicate rows...");

dtWorking << Select Duplicate Rows(
    Match( As Column(keyColA), As Column(keyColB) )
);

nDuplicates = N Rows( dtWorking << Get Selected Rows );
If( nDuplicates > 0,
    Print("Found and deleting " || Char(nDuplicates) || " duplicate rows.");
    dtWorking << Delete Rows;
);

dtWorking << Clear Select;
// 6. If a new timestamp was found, set it on the working table.
If( !Is Empty( newTimestamp ),
    dtWorking << Set Table Variable( "Last Updated", newTimestamp );
    Print( "Working table timestamp updated to: " || newTimestamp );
);

Print( "Process complete. Final table has " || Char(N Rows(dtWorking)) || " rows." );
Window( workingTableName ) << Bring Window to Front;

jthi · Jun 30, 2025 11:21 AM

It is not blocked by anything. My idea was to use Join to create new table with all the rows and then use that to overwrite all the values in the old table. This can be done for example by using data table sub-scripting

Names Default To Here(1);

dt_master = Open("$DOWNLOADS/Master_Data_Test.jmp");
dt_working = Open("$DOWNLOADS/Working_Data_Test.jmp");

dt_new = dt_master << Join(
	With(dt_working),
	Merge Same Name Columns,
	Match Flag(0),
	By Matching Columns(:Cycle ID = :Cycle ID, :Cell ID = :Cell ID),
	Drop multiples(0, 0),
	Include Nonmatches(1, 1),
	Preserve main table order(1),
	Output Table("updated master"),
	private
);

dt_master << Delete Rows(1::NRows(dt_master));
dt_master << Add Rows(N Rows(dt_new));
dt_master[0,0] = dt_new[0,0];
Close(dt_new, no save);

You can also possibly do something like this but be careful with how JMP can possibly mess with table scripts/variables

Names Default To Here(1);

dt_master = Open("$DOWNLOADS/Master_Data_Test.jmp");
dt_working = Open("$DOWNLOADS/Working_Data_Test.jmp");

// JMP can mess with your Source script if you have one
has_update_script = Contains(dt_master << Get Table Script Names, "Update") > 0;
If(has_update_script,
	dt << Rename Table Script("Update", "Original_Update");
);
// possibly unnecessary Update
dt_master << Update(
	With(dt_working),
	Match Columns(:Cycle ID = :Cycle ID, :Cell ID = :Cell ID)
);

dt_master << Delete Scripts("Source");
If(has_update_script,
	Try(dt_master << Rename Table Script("Original_Update", "Source"));
);

dt_joined = dt_master << Join(
	With(dt_working),
	Merge Same Name Columns,
	Match Flag(1),
	Copy second table formula(0),
	By Matching Columns(:Cycle ID = :Cycle ID, :Cell ID = :Cell ID),
	Drop multiples(0, 0),
	Include Nonmatches(0, 1),
	Preserve main table order(1),
	Output Table("joined"),
	private
);

old_rows = Where(dt_joined, :Match Flag == 3); // 3 = "Both"
dt_joined << Delete Columns(1);
dt_joined << Delete Rows(old_rows);

dt_master << Concatenate(
	dt_joined,
	"Append to first table"
);
Close(dt_joined, no save);
wait(0);

-Jarmo

Discussions

Effective concatenation without duplicates using scripting by matching data in two columns in different tables

Re: Effective concatenation without duplicates using scripting by matching data in two columns in different tables

Re: Effective concatenation without duplicates using scripting by matching data in two columns in different tables

Re: Effective concatenation without duplicates using scripting by matching data in two columns in different tables

Re: Effective concatenation without duplicates using scripting by matching data in two columns in different tables

Re: Effective concatenation without duplicates using scripting by matching data in two columns in different tables

Re: Effective concatenation without duplicates using scripting by matching data in two columns in different tables

Re: Effective concatenation without duplicates using scripting by matching data in two columns in different tables

Recommended Articles