Solved: create new id to deal with multiple to multiple relationship

joann · Jun 8, 2023 5:29 PM

*Update

Sorry I should be more clear on the question. The data i currently have looks like:

id email

1 a

1 b

2 a

2 c

3 d

We can see that id 1 2 and email a b c are actually the same person because the info overlaps.

What I'm trying to do here is to come up with a match table with a new key which i can use to match back to the original table to identify those overlap ones are actually the same person:

newID id/email

111 1

111 2

111 a

111 b

111 c

222 d

The newID doesn't need to be in a specific form, it can even equal to the original id, as long as we can use it to identify a "unique person".

-------------------------------------

Hi, I have data with 2 columns, one is individual id, and one is email. The problem with this data is that the relationship between individual id and email is multiple to multiple, meaning one individual id could has multiple emails, and one email could have multiple individual ids. Thus i’d need to create a mapping table of a new unique id to all individualID/email, which specifies that those individual ids or emails if any overlap are actually the same person. I couldn’t really think of a way to do it without script. Could anyone help? Thank you!

jthi · Mar 8, 2021 12:34 AM

I modified the script a little (changed recursion to while loop as recursion might be a bit difficult from time to time to understand). Also changed column names.

Names Default To Here(1);
//https://community.jmp.com/t5/Discussions/create-new-id-to-deal-with-multiple-to-multiple-relationship/m-p/365298#M61470
dt = New Table("Untitled",
	Add Rows(20),
	Compress File When Saved(1),
	New Column("Individualid",
		Numeric,
		"Continuous",
		Format("Best", 12),
		Set Values([1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 6, 7, 8, 9])
	),
	New Column("Email_HASHED",
		Character,
		"Nominal",
		Set Values(
			{"a", "b", "a", "c", "d", "e", "f", "g", "a", "h", "i", "j", "k", "l",
			"m", "n", "o", "p", "q", "b"}
		)
	)
);

dt << Size Window(500,900);
dt << New Column("newid", numeric, ordinal);

//setup first values before While loop
rowList = [1];
idList = dt:Individualid[rowList,0];
emailList = dt:Email_HASHED[rowList,0];
index = 1;
//looping
While(1,
	oldRowList = rowList;
	rowList = dt << Get Rows Where(Contains(idList, :Individualid) | Contains(emailList, :Email_HASHED));
	dt << Select Where(Contains(idList, :Individualid) | Contains(emailList, :Email_HASHED)); //for visualization
	wait(1); //for visualization
	If(N Items(oldRowList) != N Items(rowList),
		idList = dt:Individualid[rowList,0];
		emailList = dt:Email_HASHED[rowList,0],
		If(All(oldRowList == rowList),
			dt:newid[rowList] = index;
			newRow = Contains(dt:newid << get as matrix, .);
			If(newRow == 0,
				break(),
			);
			index += 1;
			rowList = Matrix(newRow);
			idList = dt:Individualid[rowList,0];
			emailList = dt:Email_HASHED[rowList,0];
		);
	);
);

You seen to have removed the dt references from your modified version, that could cause issues.

-Jarmo

View solution in original post

pauldeen · Mar 5, 2021 04:02 AM

So where would you get the info that tells you what unique number is a unique person? Based on only these two columns you are describing you have no other choice than to make one of the column the unique identifier and map the other one on to it. Maybe post some example data?

dale_lehman · Mar 5, 2021 07:05 AM

I don't think I understand what you want. It sounds like you could just concatenate the two columns and you would have a unique person/email combination for each observation. If you want each person or each email in its own row, just use Tabulate. I've attached a sample table showing the multiple-multiple mapping, with a Tabulation as well as a new column with a unique identifier. Perhaps you can use this example to tell us what you want.

pauldeen · Mar 5, 2021 08:40 AM

I think what is needed is a table like:

Person	ID	email
A	10,15	a@company.com, e@company.com, g@company.com
B	11	b@company.com, e@company.com, g@company.com
C	12,16,18	c@company.com, f@company.com, g@company.com

from a table that looks like this:

ID	email
10	...
11	...
12	..
15	..
16	..
17	..
18	..

But it is unclear to me how you would know which ID's and which email's belong to the same person so that is what my question was about.

dale_lehman · Mar 5, 2021 08:55 AM

I think you can do that by tabulating (as in my example), then making into a data table and then creating a column concatenating (multiple concatenates if you want to insert the ", ").

joann · Mar 5, 2021 11:11 AM

Hi Dale! I updated the info in the original post. Please check it, thank you!

joann · Mar 5, 2021 11:10 AM

Hi Paul! I updated the info in the original post. Please check it, thank you!

dale_lehman · Mar 6, 2021 07:12 AM

I think you need someone to script this. I've tried a number of ways but can't seem to get anything that works - and so many steps were involved that it doesn't seem worth it. I do find the situation a bit strange however - you have people with multiple email addresses, as is common, but you also have multiple people using the same email address, which also happens. However, you want to identify unique people - I don't see why you would say that two people sharing the same email address are really only one person? In your example, you appear to want to exclude person id #2 as a unique person because they share email addresses with others. But why would that mean they are not a unique person? Is it possible that you are complicating this problem more than necessary? My inclination would be to just focus on unique ids and then want the list of email addresses that are associated with them, a simple tabulate function.

joann · Mar 7, 2021 04:52 PM

Thank you Dale! Yes it's a valid point but for this business case we decided to treat people with the same email as one person in order to calculate the sequence of their other behavior.

jthi · Mar 6, 2021 08:57 AM

This definitely seems a bit weird way to with this as @dale_lehman said. You could script this at least with recursion. Below is very messy example code which seemed to work at least with my example data (if I understood the idea correctly):

Names Default To Here(1);

dt = New Table("Untitled",
	Add Rows(20),
	Compress File When Saved(1),
	New Column("id",
		Numeric,
		"Continuous",
		Format("Best", 12),
		Set Values([1, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 6, 7, 8, 9])
	),
	New Column("email",
		Character,
		"Nominal",
		Set Values(
			{"a", "b", "a", "c", "d", "e", "f", "g", "a", "h", "i", "j", "k", "l",
			"m", "n", "o", "p", "q", "b"}
		)
	)
);

dt << Size Window(500,900);
dt << New Column("newid", numeric, ordinal);

createIndex = function({tempRows, index},
	wait(0.5); //for visualization
	idList = dt:id[tempRows,0];
	emailList = dt:email[tempRows,0];
	tempRowsNew = dt << Get Rows Where(Contains(idList, :id) | Contains(emailList, :email));
	dt << Select Where(Contains(idList, :id) | Contains(emailList, :email)); //for visualization
	wait(0.5); //for visualization
	If(N Items(tempRows) != N Items(tempRowsNew),
		ids = dt:id[tempRowsNew,0];
		emails = dt:email[tempRowsNew,0];
		Recurse(tempRowsNew, index),
		If(All(tempRows == tempRowsNew),
			dt:newId[tempRowsNew] = index;
			newRow = Contains(dt:newId << get as matrix, .);
			If(newRow == 0,
				dt << Clear Select;
				return("done"),
				ids = dt:id[newRow,0];
				emails = dt:email[newRow,0];
				Recurse(matrix(newRow), index + 1);
			);

		)
	);
);

wait(1);
createIndex([1], 1);

I added extra Select Where and Waits inside the function to try to show what is going on.

-Jarmo

create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship

Re: create new id to deal with multiple to multiple relationship