I'm having an issue with the splitting of a data table and was wondering if someone knows how to it. My data looks like this:
The data should be Split by ID2 with ID being the group variable. When I am doing it that way I end up with this result:
The Problem is that for IDs 2 and 4 there are multiple values of C in column ID2. As it seems JMP is always using the last value in the splitted data table.
Is there any automatic way to do it so that I end up with a table like the following?
Here there are two rows for IDs 2 and 4 representing the different values of C in the original data.
Thanks for your help!
Sebastian
Just combining/summarizing what was mentioned. Make sure you sort the data before hand or the Lag() function will have issues. Also, using a formula will be faster than a for each row() command. If you are worried about post sorting that may affect the lag() function later on, just delete the formula to avoid conflicts. Thus:
NamesDefaultToHere(1);
// Table
dt1 = New Table( "Sebastian",
New Column( "ID",
Numeric,
"Nominal",
Format( "Best", 12 ),
Set Values( [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 2, 2] )
),
New Column( "ID2",
Character,
"Nominal",
Set Values(
{"A", "B", "C", "D", "A", "B", "C", "C", "A", "B", "C", "D", "A", "B",
"C", "C", "C", "B"}
)
),
New Column( "Value",
Numeric,
"Nominal",
Format( "Best", 12 ),
Set Values(
[111, 222, 333, 444, 111, 222, 333, 555, 111, 222, 555, 444, 111, 222,
555, 666, 777, 888]
)
)
);
// the table needs to be sorted for the formula in the next step to work
dt1 << Sort(Replace table,
By( :ID, :ID2 ),
Order( Ascending, Ascending )
);
// Add an auxiliary column
dt1 << New Column( "Repeat",
Numeric,
"Nominal",
Format( "Best", 12 ),
Formula( If(:ID2 != Lag(:ID2, 1) | Row() == 1, 1, Lag(:Repeat, 1) + 1) )
);
:Repeat << Delete Formula; // faster than 'for each row()'
// Split
dt2 = dt1 << Split(
Split By( :ID2 ),
Split( :Value ),
Group( :ID, :Repeat )
);
// Delete aauxiliary column
dt2 << deleteColumn(:Repeat);
I think you need an auxiliary column, Sebastian.
NamesDefaultToHere(1);
// Table
dt1 = New Table( "Sebastian",
Add Rows( 16 ),
New Column( "ID",
Numeric,
"Nominal",
Format( "Best", 12 ),
Set Values( [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4] )
),
New Column( "ID2",
Character,
"Nominal",
Set Values(
{"A", "B", "C", "D", "A", "B", "C", "C", "A", "B", "C", "D", "A", "B",
"C", "C"}
)
),
New Column( "Value",
Numeric,
"Nominal",
Format( "Best", 12 ),
Set Values(
[111, 222, 333, 444, 111, 222, 333, 555, 111, 222, 555, 444, 111, 222,
555, 666]
)
)
);
// Add an auxiliary column
Wait(3);
dt1 << New Column( "Repeat",
Numeric,
"Nominal",
Format( "Best", 12 ),
Set Values( [1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2] )
);
// Split
dt2 = dt1 << Split(
Split By( :ID2 ),
Split( :Value ),
Group( :ID, :Repeat )
);
// Delete auxilliary column
Wait(3);
dt2 << deleteColumn(:Repeat);
Thanks for the support Ian!
That was more or less what I was doing. The problem in the real world is of course that the data set has ~3k rows and I could not imagine a formula to create the repeat-column automatically rather than manually.
Do you know a solution for that as well? I would like to avoid writing a script that nests multiple for-loops to generate the "repeat"-column. That sounds so inefficient.
Thanks,
Sebastian
Oops! Sorry, yes Sebastian. Maybe something like?
// Add an auxiliary column
Wait( 3 );
dt1 << New Column( "Repeat",
Numeric,
"Nominal",
Format( "Best", 12 ),
Formula( If(:ID2 != Lag(:ID2, 1) | Row() == 1, 1, Lag(:Repeat, 1) + 1) )
);
hi sebastianhoffmeister,
try this:
dt = New Table( "Original Data",
Add Rows( 16 ),
New Column( "ID",
Numeric,
Continuous,
Format( "Best", 12 ),
Set Values( [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4] )
),
New Column( "ID2",
Character,
Nominal,
Set Values(
{"A", "B", "C", "D", "A", "B", "C", "C", "A", "B", "C", "D", "A", "B",
"C", "C"}
)
),
New Column( "Value",
Numeric,
Continuous,
Format( "Best", 12 ),
Set Values(
[111, 222, 333, 444, 111, 222, 333, 555, 111, 222, 555, 444, 111, 222,
555, 666]
)
)
);
// the table needs to be sorted for the formula in the next step to work
dt << Sort(Replace table,
By( :ID, :ID2 ),
Order( Ascending, Ascending )
);
// introduce an new index that will recognize the duplications
dt << new column ("ID3");
for each row (If( // do not use a formula in the column since it is dependent on sorting that may change later.
Row() == 1, ID3 =:ID,
:ID != Lag( :ID, 1 ) | :ID2 != Lag( :ID2, 1 ),ID3 = :ID,
ID3 = Lag( :ID3, 1 ) + 1
));
// now you can split the table
dtsorted = dt << Split(
Split By( :ID2 ),
Split( :Value ),
Group( :ID3, :ID ),
Output Table( "split table" )
);
best,
ron
Just combining/summarizing what was mentioned. Make sure you sort the data before hand or the Lag() function will have issues. Also, using a formula will be faster than a for each row() command. If you are worried about post sorting that may affect the lag() function later on, just delete the formula to avoid conflicts. Thus:
NamesDefaultToHere(1);
// Table
dt1 = New Table( "Sebastian",
New Column( "ID",
Numeric,
"Nominal",
Format( "Best", 12 ),
Set Values( [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 2, 2] )
),
New Column( "ID2",
Character,
"Nominal",
Set Values(
{"A", "B", "C", "D", "A", "B", "C", "C", "A", "B", "C", "D", "A", "B",
"C", "C", "C", "B"}
)
),
New Column( "Value",
Numeric,
"Nominal",
Format( "Best", 12 ),
Set Values(
[111, 222, 333, 444, 111, 222, 333, 555, 111, 222, 555, 444, 111, 222,
555, 666, 777, 888]
)
)
);
// the table needs to be sorted for the formula in the next step to work
dt1 << Sort(Replace table,
By( :ID, :ID2 ),
Order( Ascending, Ascending )
);
// Add an auxiliary column
dt1 << New Column( "Repeat",
Numeric,
"Nominal",
Format( "Best", 12 ),
Formula( If(:ID2 != Lag(:ID2, 1) | Row() == 1, 1, Lag(:Repeat, 1) + 1) )
);
:Repeat << Delete Formula; // faster than 'for each row()'
// Split
dt2 = dt1 << Split(
Split By( :ID2 ),
Split( :Value ),
Group( :ID, :Repeat )
);
// Delete aauxiliary column
dt2 << deleteColumn(:Repeat);
thanks msharp,
by the way, in a previous post @MS offerd this formula for indexing :Sum( :ID[Index( 1, Row() )] == :ID )@ms offerd this formula for indexing :Sum( :ID[Index( 1, Row() )] == :ID )
Re: Syntax: How do I create an index column by ID
is there a way for using it in this context? with two index columns and one of them is alphabetic?
best,
ron
I would stay away from this approach in general, static data is generally safer than dynamic data. You have to ask yourself how important is the order of the data? This solution will keep the keys in order, but the values are free to roam. BUT...
That solution will only work with numeric columns. If we were to replace :ID2 with :ID2Num where A's = 1, B's =2 ect. you could simply use an & statement, thus :Sum( :ID[Index( 1, Row() )] == :ID & :ID2Num[Index( 1, Row() )] == :ID2Num ) for the formula.
If you have character columns you'll have to be more creative. Thus for the formula:
length(concat items(:ID2[Index( 1, Row() )],"")) - length(substitute(concat items(:ID2[Index( 1, Row() )],""), :ID2,""))
or
dt1 << New Column("Concat", Character, Nominal, Formula(:ID2 || char(:ID)));
dt1 << New Column("Repeat", Character, Nominal, Formula((length(concat items(:Concat[Index( 1, Row() )],"")) - length(substitute(concat items(:Concat[Index( 1, Row() )],""), :Concat,"")))/2)); //the 2 at the end accounts for the length of the concatenated string
There's a lot going on here, and I don't have the energy to explain it now. But let me know if you have questions; really, there's a more viable simpler solution above.
thanks msharp,
it works great.
and you are write, keeping things simple is probably the best way to go.