Comparing multiple rows in a large table & keeping row with largest value
Jan 4, 2010 9:37 AM(1810 views)
I am somewhat new to JMP...I have a large 1,000,000+ row table (13 columns) that I need to consolidate in the following way:
I have multiple rows for observations on a single case- (not each case is observed the same number of times- some have 3 observations, some 10). I want to construct a summary table, keeping the observation with the largest values for each unique case.
How do you define "the observation with the largest values for each unique case"? Assuming that one of the columns contains the case id, that leaves 12 columns each of which could have a "largest value" for each unique case.
If you have a column with an unique case identifier and assuming that any multiple measurements/variables for each observation are in the same row (but in different columns), I'd use the Summary command in the Tables menu.
In the Summary dialog, choose to group by your case id and then select the variables that you want in the summary table and choose "Max" in the Statistics drop-down list.
That would give you a table with one row for each case and the highest values for each variable. A column with the number of observations (= nr of rows, incl. those with any missing values) for each case will also be generated automatically.
I wonder what leapfrog is really wanting. I had the vague impression that he wants a subset of the original table. Using Summary the way you describe will produce a table that may have rows not present in the original table.
With "Max" a Summary is equal to a subset, as there is only one maximum value. However if it is important to know exact which one of the multiple observations of each "case" that has the highest value, and not just the value, then he must also group after an observation-id column in addition to the case id. If their are ties, then of course he will get more then one row per case in the summary table.
My hint above assumes that the multiple observations are vertically arranged, i.e. a case with three obs. has three rows, and a case with ten obs. has ten rows etc. However if the number of observations is distributed across columns, with empty cells for cases with less then the max nr of observations, the table must first be stacked in order to use Summary the way I described.
Or maybe I have misunderstood the problem completely...
That is exactly what I meant and I see your point. If he instead want the max from just one column (Y1) and keep the observations in remaining columns as is (i.e a subset), he need to run "Update" on the summary table with the original table (matching columns by ID and Y1). That would create such a subset, except for the additional row count column.
There may still be a problem with ties. Then additional criteria must be considered e.g. based on the values in the other columns.