Discussions

joann · Jun 8, 2023 5:29 PM

*Update

Sorry I should be more clear on the question. The data i currently have looks like:

id email

1 a

1 b

2 a

2 c

3 d

We can see that id 1 2 and email a b c are actually the same person because the info overlaps.

What I'm trying to do here is to come up with a match table with a new key which i can use to match back to the original table to identify those overlap ones are actually the same person:

newID id/email

111 1

111 2

111 a

111 b

111 c

222 d

The newID doesn't need to be in a specific form, it can even equal to the original id, as long as we can use it to identify a "unique person".

-------------------------------------

Hi, I have data with 2 columns, one is individual id, and one is email. The problem with this data is that the relationship between individual id and email is multiple to multiple, meaning one individual id could has multiple emails, and one email could have multiple individual ids. Thus i’d need to create a mapping table of a new unique id to all individualID/email, which specifies that those individual ids or emails if any overlap are actually the same person. I couldn’t really think of a way to do it without script. Could anyone help? Thank you!

joann · Mar 7, 2021 8:37 PM

Thank you so much Jarmo! This outcome out of your script is exactly what i need. I'm not familiar with script so just one following question on how to apply it on my current table. the two column names are "Individualid" and "Email_HASHED", I pasted the below edited script to my table but it showed error. could you debug for me? Thank you!

New Column("newid", numeric, ordinal);

createIndex = function({tempRows, index},
wait(0.5); //for visualization
idList = Individualid[tempRows,0];
emailList = Email_HASHED[tempRows,0];
tempRowsNew = Get Rows Where(Contains(idList, :Individualid) | Contains(emailList, :Email_HASHED));
Select Where(Contains(idList, :Individualid) | Contains(emailList, :Email_HASHED)); //for visualization
wait(0.5); //for visualization
If(N Items(tempRows) != N Items(tempRowsNew),
ids = Individualid[tempRowsNew,0];
emails = Email_HASHED[tempRowsNew,0];
Recurse(tempRowsNew, index),
If(All(tempRows == tempRowsNew),
newId[tempRowsNew] = index;
newRow = Contains(newId << get as matrix, .);
If(newRow == 0,
Clear Select;
return("done"),
ids = Individualid[newRow,0];
emails = Email_HASHED[newRow,0];
Recurse(matrix(newRow), index + 1);
);

)
);
);

wait(1);
createIndex([1], 1);

jthi · Mar 8, 2021 12:34 AM

I modified the script a little (changed recursion to while loop as recursion might be a bit difficult from time to time to understand). Also changed column names.

Names Default To Here(1);
//https://community.jmp.com/t5/Discussions/create-new-id-to-deal-with-multiple-to-multiple-relationship/m-p/365298#M61470
dt = New Table("Untitled",
	Add Rows(20),
	Compress File When Saved(1),
	New Column("Individualid",
		Numeric,
		"Continuous",
		Format("Best", 12),
		Set Values([1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 6, 7, 8, 9])
	),
	New Column("Email_HASHED",
		Character,
		"Nominal",
		Set Values(
			{"a", "b", "a", "c", "d", "e", "f", "g", "a", "h", "i", "j", "k", "l",
			"m", "n", "o", "p", "q", "b"}
		)
	)
);

dt << Size Window(500,900);
dt << New Column("newid", numeric, ordinal);

//setup first values before While loop
rowList = [1];
idList = dt:Individualid[rowList,0];
emailList = dt:Email_HASHED[rowList,0];
index = 1;
//looping
While(1,
	oldRowList = rowList;
	rowList = dt << Get Rows Where(Contains(idList, :Individualid) | Contains(emailList, :Email_HASHED));
	dt << Select Where(Contains(idList, :Individualid) | Contains(emailList, :Email_HASHED)); //for visualization
	wait(1); //for visualization
	If(N Items(oldRowList) != N Items(rowList),
		idList = dt:Individualid[rowList,0];
		emailList = dt:Email_HASHED[rowList,0],
		If(All(oldRowList == rowList),
			dt:newid[rowList] = index;
			newRow = Contains(dt:newid << get as matrix, .);
			If(newRow == 0,
				break(),
			);
			index += 1;
			rowList = Matrix(newRow);
			idList = dt:Individualid[rowList,0];
			emailList = dt:Email_HASHED[rowList,0];
		);
	);
);

You seen to have removed the dt references from your modified version, that could cause issues.

-Jarmo

joann · Mar 8, 2021 10:33 AM

Thank you Jarmo! How do I change the dt to my current data table but not creating the new table with the samples you put in?

pauldeen · Mar 8, 2021 10:53 AM

dt = current data table();

or

dt = Data table("Name of table here")

jthi · Mar 8, 2021 11:18 AM

And depending how you get the datatable you can reference it directly:

dt = Open("$SAMPLE_DATA/Big Class.jmp");
dt1 = dt << Subset(All rows);
dt2 = Query(dt, "SELECT name, age, height FROM 'Big Class' WHERE age > 14; ");

and so on

-Jarmo

joann · Mar 8, 2021 05:39 PM

Thank you so much Jarmo! It works!! Really appreciate your help.

joann · Mar 8, 2021 06:04 PM

Hi Jarmo, I forgot to mention that in the email column there are some rows having empty cells, meaning some individualID does not have anything populated under email column. I found that the current script will consider all individual ids having no emails as the same new id. could you help adjust the script to avoid this kind of situation? Thank you!

jthi · Mar 9, 2021 12:54 AM

How should they be handled? Most likely it is one or two if statements.

Small example dataset with correct answers could help.

-Jarmo

joann · Mar 9, 2021 10:40 AM

Hi Jarmo, below is an example. you can see many rows having empty email are all grouped as newid 5. If the individualID is not overlapped with other groups, those without emails should each have their own newid. Thank you!

jthi · Mar 9, 2021 12:34 PM

Couldn't replicate your results (most likely you have different individual ids on the one starting with 308....)

Does the result from this look correct?

Names Default To Here(1);
//https://community.jmp.com/t5/Discussions/create-new-id-to-deal-with-multiple-to-multiple-relationship/m-p/365298#M61470

/*
dt = New Table("Untitled",
	Add Rows(20),
	Compress File When Saved(1),
	New Column("Individualid",
		Numeric,
		"Continuous",
		Format("Best", 12),
		Set Values([1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 6, 7, 8, 9])
	),
	New Column("Email_HASHED",
		Character,
		"Nominal",
		Set Values(
			{"a", "b", "a", "c", "d", "e", "f", "g", "a", "h", "i", "j", "k", "l",
			"m", "n", "o", "p", "q", "b"}
		)
	)
);
*/

dt = New Table("Untitled",
	Add Rows(19),
	Compress File When Saved(1),
	New Column("Individualid",
		Numeric,
		"Continuous",
		Format("Best", 12),
		Set Values(
			[139, 100, 229, 301, 308, 284, 682, 136, 586, 256, 122, 711, 629, 116,
			152, 443, 301, 695, 311]
		)
	),
	New Column("Email_HASHED",
		Character,
		"Nominal",
		Set Values(
			{"A648", "CAA", "E7B", "5B2", "", "237A", "", "FE0", "", "7C21", "",
			"920B", "", "D394", "", "2FE1", "", "", ""}
		)
	)
);

dt << Size Window(500,900);
dt << New Column("newid", numeric, ordinal);

//setup first values before While loop
rowList = [1];
idList = dt:Individualid[rowList,0];
emailList = dt:Email_HASHED[rowList,0];
index = 1;
//looping
While(1,
	oldRowList = rowList;
	Remove From(emailList, Contains(emailList, ""));
	rowList = dt << Get Rows Where(Contains(idList, :Individualid) | Contains(emailList, :Email_HASHED));
	//dt << Select Where(Contains(idList, :Individualid) | Contains(emailList, :Email_HASHED)); //for visualization
	//wait(1); //for visualization
	If(N Items(oldRowList) != N Items(rowList),
		idList = dt:Individualid[rowList,0];
		emailList = dt:Email_HASHED[rowList,0],
		If(All(oldRowList == rowList),
			dt:newid[rowList] = index;
			newRow = Contains(dt:newid << get as matrix, .);
			
			newRow = 0;
			For Each Row(dt,
				If(:Email_HASHED != "" & IsMissing(:newId),
					newRow = Row();
					break();
				)
			);
			
			If(newRow == 0,
				break(),
			);
			index += 1;
			rowList = Matrix(newRow);
			idList = dt:Individualid[rowList,0];
			emailList = dt:Email_HASHED[rowList,0];
		);
	);
);
//dt << Clear Select;
//handle missings
index++;
missingRows = dt << Get Rows Where(IsMissing(:newId));
dt:newId[missingRows] = index

-Jarmo

Discussions

create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Recommended Articles