topic Re: Find and identify duplicates for unsorted data in Discussions

Find and identify duplicates for unsorted data

CMG — Sat, 10 Jun 2023 23:27:30 GMT

Fairly new to JMP, so my knowledge of how to use formulas is pretty limited.

If I have a column of values (character) and I want to identify them as duplicates in a different column, what formula would I use? (My data has 10 columns already sorted by a different column's values)

In the past, I have used the formula below (when the data that needs duplicates identified has been sorted alphabetically).

If(
	Row() == 1, 1,
	:Column1 != Lag( :Column1 ), 1,
	0
)

Since I am not allowed to present data sorted in any other way than the way it comes in, I am kind of at a loss as to how to indicate duplicates.

Any and all help is appreciated.

Re: Find and identify duplicates for unsorted data

Mauro_Gerber — Fri, 19 Mar 2021 15:34:15 GMT

Maybe this line helps

dt = Current Data Table();

dt << Select duplicate rows( Match( :column_1, :column_2, :column_3) );

This will select rows that are not unique in the upper combination of column 1 to 3.

With the selection you can now delete, hide and exclude, fill an additional column or what you want to do with them.

r_select = dt << select rows( dt << get selected rows ); // this gets you a vector of row number.

r_select << Delete Rows;

// or
r_select << Hide and Exclude;

Re: Find and identify duplicates for unsorted data

CMG — Fri, 19 Mar 2021 14:48:52 GMT

Thank you for the response.

I don't think I explained myself well. What I want is a helper column to indicate if a certain value is duplicate, maybe with a "1" indicating all duplicates, and a "0" indicating no duplicates. I do not want to hide or delete the duplicates.

Re: Find and identify duplicates for unsorted data

jthi — Fri, 19 Mar 2021 15:01:49 GMT

Below are at least two ways to do this with scripting:

Names Default To Here(1);

dt = New Table("Untitled",
	Add Rows(7),
	Compress File When Saved(1),
	New Column("Column 1",
		Numeric,
		"Continuous",
		Format("Best", 12),
		Set Values([1, 1, 2, 3, 2, 5, 1]),
		Set Display Width(60)
	)
);


dt << New column("Dublicate_formula", Numeric, <<Formula(
	If(Col Rank(1, :column 1) > 1, 1,0))
);

dt << Select duplicate rows(Match(:column 1));
dubRows = dt << Get Selected Rows;
dt << New Column("Dublicates", Numeric, Nominal);
Column(dt, "Dublicates")[dubRows] = 1;
Column(dt, "Dublicates")[dt << Get Rows Where(IsMissing(:Dublicates))] = 0;

You can also do this without any scripting:

Select columns you are interested in.
From Rows menu: Row Selection -> Select Dublicate Rows
Rows menu: Row Selection -> Name Selection In Column
Choose name, Selected as 1 and Unselected as 0

Re: Find and identify duplicates for unsorted data

txnelson — Fri, 19 Mar 2021 15:36:35 GMT

You can also do this as a formula column using the below formula which is using the Big Class data table as an example

If( Row() == 1,
	Current Data Table() << select duplicate rows(
		Match( :age, :sex, Empty() )
	)
);
If( Selected( Row State( Row() ) ),
	"Group 1",
	"Group 2"
);

Re: Find and identify duplicates for unsorted data

ms — Fri, 19 Mar 2021 22:09:46 GMT

Here's another example using a column formula.

dt = Open( "$SAMPLE_DATA/Big Class.jmp" );

// Create a column that indicates duplicate names (there are two "Robert")

dt << New Column( "Duplicate", Formula( If( Col Number( Row(), :name ) > 1, 1, 0 ) ) );

Re: Find and identify duplicates for unsorted data

CMG — Mon, 22 Mar 2021 17:25:46 GMT

Thank you very much. I was able to mark without using the scripting.

Appreciate your help!