Try the Materials Informatics Toolkit, which is designed to easily handle SMILES data. This and other helpful add-ins are available in the JMP® Marketplace
Looking to generate two columns based on some pattern from other columns
Created:
Dec 16, 2021 03:56 PM
| Last Modified: Jun 9, 2023 11:07 AM(3244 views)
Hi All,
I am looking to generate two new columns based on patterns in other two columns ( "Required column1" and "Required Column2"). Please see the attached image -
I gave a go with this. I try to avoid using Lag() whenever possible and replace it with Col functions. The most difficult formula seems to be Required Column 1 and I couldn't quickly figure out good formula without using Lag() (there is most likely a solution with Col Cumulative Sum somehow). One idea would be to fill places between Start and End with "Ongoing" or something similar, then you could maybe use Col Cumulative Sum easier, but then the most difficult formula would most likely just move to Start and End.
Names Default To Here(1);
dt = Current Data Table();
//Numbering of Occurrence
dt << New Column("NumberOfOcc", Numeric, Nominal, Formula(
If(:Sensor == "Primary",
Col Cumulative Sum(:Sensor == "Primary", :ID, :"Sub-set"n, :Occurrence),
.
);
));
//Start and End
dt << New column("StartEnd", Character, Nominal, Formula(
If(:NumberOfOcc == 1,
"End"
, Col Max(:NumberOfOcc, :ID,:"Sub-set"n, :Occurrence) == :NumberOfOcc,
"Start"
, //else
""
)
));
//Required Column 1, this could most likely be made somehow with Col Cumulative Sum
dt << New column("Req1", Numeric, Ordinal, Formula(
If(Row() == 1,
val = .
);
If(:StartEnd == "Start",
val = 1
, IsMissing(:StartEnd) | :StartEnd == "End",
val += 1;
);
If(:Lag(:StartEnd) == "End" | (Row() == Col Max(Row()) & :StartEnd == "Start") | (:StartEnd == "Start" & Lag(:StartEnd, -1) == "End"), //more missing checks can be added here
val = .,
val;
)
));
//Required Column 2
dt << New column("Req2", Numeric, Nominal, Formula(
If(!IsMissing(:Req1),
Col Cumulative Sum(:StartEnd == "Start",
:ID, :"Sub-set"n
)
,
.
)
));
If you have time, I would suggest testing both ways of solving this, the one @txnelson has provided with more heavy use of Lag and comparisons, and the one I have with Col formulas. Both definitely have their use cases and understanding them will help you in a long run when writing formulas/calculations in JMP. And in some cases when you understand how to combine both, you can solve quite difficult tasks with just formulas.
Also after you get the logic working, I would most likely remove Formulas from the datatable (or replace Formula() with << Set Each value in script), because this will most likely break if order is changed in datatable.
Here is a formula for the column you refer to Required Column 2. When you study the code, you will see that all it is doing is applying the logic rules you have stated that need to be checked when determining the Start and End values. It also contains the setting of the value for the column Required Column 1. To run this formula, simply create the Required Column 1 and Required Column 2. Set Required Column 2 to be a character column. Then open the formula for Required Column 2 and copy the formula below into the column.
It almost worked with few false entries. I believe we must add extra condition on :Occurrence column. That is :Occurrence must change to set :Required Column 2 == "Start" (and may be option - it should be same when :Required Column 2 == "End").
Something like - Dif(:Occurrence) !=0 then only :Required Column 2 == "Start"
For my sample data (95 rows) was fine with your formula but my actual data which has more than 300k rows, it is giving many wrong entries like below - (I think it should be fine once additional requirement on :Occurrence will be added.) This is just one off screenshot of the wrong entry.
Hopefully!! I can modify the formula/code and can add requirement of :Occurrence. on top of your formula.
I was able to modify your code based perfectly for my requirement. But on the 2nd stage troubleshooting I need something like this --
Mark :Sensor = "Primary" (1st) around consecutive change in :Occurrence. In the given screenshot -
required values in between --- Last "Primary" in Occurrence = 2 and First "Primary" in Occurrence = 3 [consecutive change in Occurrence]
Finding first Primary is not that difficult, not sure how to find last primary ? Thinking to somehow invert (descending order) the subset data for Occurrence = 2, find the value and then revert it back. Not able to put in a formula!! Any other way to do that ?
You have modified the code I previously provided to get the results that you provide in your most recent response. Please provide the JSL you are using to get the results you are showing in your sample data table.
I am not able to get the correct (updated requirement) start point. In the Data Table ("Sample Data Updated") and image, I had entered corrected data columns manually.
The formula. I used to get "Required Column 1 Old" and " Required Column 2 Old" is this (It is just like you give me, I added occurrence condition on it) -
i am confused. From what I can see, the script that I wrote seems to work, except that after line 44 in the data table, you have changed the rules, and now want the End to be on the row prior to the next value of Primary. Which is what you want?
Your original formula (I believe, I have not messed it up) worked for most of the cases but also captured few wrong points and missed few on full set of data. In the attached subset of the data (only for one "ID" and one value of "Sub-set") there is a column based on your formula which has few wrong points.
And this was fixed when I put condition on :Occurrence.
Now I am looking for -
1. Last Primary Sensor in occurrence = n (say n = 2), mark it as "Start"
2. First Primary Sensor in Occurrence = n+1 (= 3), and if there is no primary sensor for n = 3, it should stope/mark the value in Occurrence = n+2 == 4. Mark it as "End"
3. And last fill the numbers between "Start" and "End" in a new column.
Alternate way to think of (may be) -- If we can mark - in any single occurrence n (n = 2) First Primary Sensor as END and in last primary sensor as Start and then fill a values between Start and End in a new column.
The issue appears to be with the logic being applied to the JSL variable "other". I do not have time to work on the issue today, but tomorrow afternoon, I should be able to do some investigation.
I did some progress - and identify correctly the start and end point of each set. I used a very simple logic, you can see in the formula column.
But now I am struggling to put numbers between each pair of "Start" and "End" point. I getting issue in stopping the counter at "End". My issue might have the same complicacy but at least now I have clearing marking/tag on my start and end point.