Discussions

SophieCuvillier · Sep 24, 2025 01:16 AM

Hello,

We have a JMP table with millions of rows (image data), and each operation is costly and time-consuming. However, we need to add a new column that will be calculated from the other columns.

What is the fastest way to do this with large data sets? Currently, we are using Set Formula on the column, but it takes a long time to apply and, what's more, the formula is bound to be changed several times during the analysis. Should we use ‘get values’ vectors or something like that? I have also seen the ‘for each rows’ function. I would appreciate any feedback and tips you may have.

Craige_Hales · Sep 25, 2025 09:41 AM

@txnelson is right, a matrix might improve the speed. Make sure you are using <<runformulas first; that might be fast enough.

dt = New Table( "Reprex",
    Add Rows( 1000000 ),
    New Column( "X", Numeric, Continuous, Formula( Random Integer( 0, 200 ) ) ),
    New Column( "Y", Numeric, Continuous, Formula( Random Integer( 0, 200 ) ) )
);

R_debut_X = 10;
largeur   = 50;
ecart_X   = 5;
debut_Y   = 20;
hauteur   = 30;
ecart_Y   = 5;
start = hptime();
// Slow way: row formula
dt << New Column( "Zone_slow", Character,
    Formula(
        If( R_debut_X < :X <= R_debut_X + largeur & debut_Y < :Y <= debut_Y + hauteur, "Zone 1",
            R_debut_X + largeur + ecart_X < :X <= R_debut_X + 2*largeur + ecart_X & debut_Y < :Y <= debut_Y + hauteur, "Zone 2",
            "Other"
        )
    )
);
dt<<runformulas; // <<<<<<<<<<<< important <<<<<<<<
stop=hptime();
show((stop-start)/1e6);// 1.3 sec

start = hptime();
dmatx = dt[0,{x}];
dmaty = dt[0,{y}];
cmatx1 = R_debut_X < dmatx <= R_debut_X + largeur;
cmaty1 = debut_Y < dmaty <= debut_Y + hauteur;
cmatxy1 = cmatx1 & cmaty1;
cmatx2 = R_debut_X + largeur + ecart_X < dmatx <= R_debut_X + 2*largeur + ecart_X;
cmaty2 = debut_Y < dmaty <= debut_Y + hauteur;
cmatxy2 = cmatx2 & cmaty2;
result = cmatxy1 + cmatxy2*2;
stop=hptime();
dt<<newcolumn("Zone_fast", values(result));
show((stop-start)/1e6);// .25 sec

The result value is using the 0 or 1 truth value in cmatxy1,2 to build the 0, 1, or 2 value for other, zone1, zone2. You could probably collapse some of the matrix statements to make it still faster if that is important.

Craige

View solution in original post

txnelson · Sep 24, 2025 2:13 AM

Many times, moving away from conventional row by row JSL in favor of using matrix operations will result in faster processing. However, it would be helpful if you could provide an example of the actual formulas you are using.

Jim

SophieCuvillier · Sep 25, 2025 07:28 AM

Hello,

Thank you for your response.

To add more context: Below is a script with an example of the formula I use. Based on the X/Y coordinates on the image, I divide it into zones, 20 in total on my real date. If the classification of the zones is not correct, I will modify the formula, which is why I say that we will frequently use set formulas.

dt = New Table( "Reprex",
    Add Rows( 1000000 ),
    New Column( "X", Numeric, Continuous, Formula( Random Integer( 0, 200 ) ) ),
    New Column( "Y", Numeric, Continuous, Formula( Random Integer( 0, 200 ) ) )
);

R_debut_X = 10;
largeur   = 50;
ecart_X   = 5;
debut_Y   = 20;
hauteur   = 30;
ecart_Y   = 5;

// Slow way: row formula
dt << New Column( "Zone_slow", Character,
    Formula(
        If( R_debut_X < :X <= R_debut_X + largeur & debut_Y < :Y <= debut_Y + hauteur, "Zone 1",
            R_debut_X + largeur + ecart_X < :X <= R_debut_X + 2*largeur + ecart_X & debut_Y < :Y <= debut_Y + hauteur, "Zone 2",
            "Other"
        )
    )
);

Craige_Hales · Sep 25, 2025 09:41 AM

@txnelson is right, a matrix might improve the speed. Make sure you are using <<runformulas first; that might be fast enough.

dt = New Table( "Reprex",
    Add Rows( 1000000 ),
    New Column( "X", Numeric, Continuous, Formula( Random Integer( 0, 200 ) ) ),
    New Column( "Y", Numeric, Continuous, Formula( Random Integer( 0, 200 ) ) )
);

R_debut_X = 10;
largeur   = 50;
ecart_X   = 5;
debut_Y   = 20;
hauteur   = 30;
ecart_Y   = 5;
start = hptime();
// Slow way: row formula
dt << New Column( "Zone_slow", Character,
    Formula(
        If( R_debut_X < :X <= R_debut_X + largeur & debut_Y < :Y <= debut_Y + hauteur, "Zone 1",
            R_debut_X + largeur + ecart_X < :X <= R_debut_X + 2*largeur + ecart_X & debut_Y < :Y <= debut_Y + hauteur, "Zone 2",
            "Other"
        )
    )
);
dt<<runformulas; // <<<<<<<<<<<< important <<<<<<<<
stop=hptime();
show((stop-start)/1e6);// 1.3 sec

start = hptime();
dmatx = dt[0,{x}];
dmaty = dt[0,{y}];
cmatx1 = R_debut_X < dmatx <= R_debut_X + largeur;
cmaty1 = debut_Y < dmaty <= debut_Y + hauteur;
cmatxy1 = cmatx1 & cmaty1;
cmatx2 = R_debut_X + largeur + ecart_X < dmatx <= R_debut_X + 2*largeur + ecart_X;
cmaty2 = debut_Y < dmaty <= debut_Y + hauteur;
cmatxy2 = cmatx2 & cmaty2;
result = cmatxy1 + cmatxy2*2;
stop=hptime();
dt<<newcolumn("Zone_fast", values(result));
show((stop-start)/1e6);// .25 sec

The result value is using the 0 or 1 truth value in cmatxy1,2 to build the 0, 1, or 2 value for other, zone1, zone2. You could probably collapse some of the matrix statements to make it still faster if that is important.

Craige

SophieCuvillier · Sep 26, 2025 08:02 AM

Thank you very much, it is mush faster now !

Discussions

Quickest way to iterate over a data table

Re: Quickest way to iterate over a data table

Re: Quickest way to iterate over a data table

Re: Quickest way to iterate over a data table

Re: Quickest way to iterate over a data table

Re: Quickest way to iterate over a data table

Recommended Articles