Solved: Re: How to Join and get common matching columns

a01 · Feb 3, 2015 05:29 PM

One of the things that I'm struggling with in JMP is the fact that, when I want to do a many-for-many table join using the "Join" function on matching columns, then the match columns in the resulting table always get/stay segregated between each of the two source tables. JMP doesn't seem to provide an automatic or convenient way to get the common match column(s) with the values filled in for all rows regardless of which source table the key was present in. It would be able to get such common fully-filled match column(s), with "presence" columns to indicate whether that key was present in each source table.

Let me illustrate with an example. Suppose I have two tables, "ArtClass" and "GymClass". The "primary key" in each table is the combination of "FirstName" and "LastInitial".

ArtClass:

FirstName	LastInitial	Favorite color	Score
John	S	Red	4
John	B	Blue	5
Alice	A	Green	6
Alice	M	Yellow	7
Bob	S	Purple	6
Sally	C	Blue	5

GymClass:

FirstName	LastInitial	Favorite activity	Score
John	S	Monkey bars	8
John	A	Shot put	7
Alice	P	Soccer	6
Bob	S	Basketball	5
Sally	C	Field hockey	5
Bob	M	Crunches	6

If I do a Join on Matching Columns, equating FirstName and LastInitial, without dropping multiples and including non-matches in both tables (an "outer join"), then I obtain the following, which we'll call "Students1" (Edit: Please forgive me if any of these tables are getting cut off the screen; maybe try copying this into a MS Word document because of how this webpage displays.):

Students1:

FirstName of ArtClass	LastInitial of ArtClass	Favorite color	Score of ArtClass	FirstName of GymClass	LastInitial of GymClass	Favorite activity	Score of GymClass
				John	A	Shot put	7
				Alice	P	Soccer	6
				Bob	M	Crunches	6
John	S	Red	4	John	S	Monkey bars	8
John	B	Blue	5
Alice	A	Green	6
Alice	M	Yellow	7
Bob	S	Purple	6	Bob	S	Basketball	5
Sally	C	Blue	5	Sally	C	Field hockey	5

What I would much rather get is the following kind of table, which we'll call "Students2":

Students2:

FirstName	LastInitial	Favorite color	Score of ArtClass	Present in ArtClass	Favorite activity	Score of GymClass	Present in GymClass
Alice	A	Green	6	1			0
Alice	M	Yellow	7	1			0
Alice	P			0	Soccer	6	1
Bob	M			0	Crunches	6	1
Bob	S	Purple	6	1	Basketball	5	1
John	A			0	Shot put	7	1
John	B	Blue	5	1			0
John	S	Red	4	1	Monkey bars	8	1
Sally	C	Blue	5	1	Field hockey	5	1

Notice how in the second version, I have just one common set of the matching columns -- FirstName and LastInitial -- which is fully populated for all rows. I also have columns to tell me if the key was present in each of the source tables (ArtClass and GymClass). In contrast, in the first version (Students1), the only way you can tell if each key was present in each source table, is by seeing if the segregated key columns ("FirstName of ArtClass"/"LastInitial of ArtClass", "FirstName of GymClass"/"LastInitial of GymClass") are missing or present.

A way that I contrived the second version, Students2, was by doing the following roundabout procedure:

Subset or Summary each of the two source tables (ArtClass and GymClass) by the match columns, "FirstName" and "LastInitial".
Concatenate the two subset tables made in #1.
Summary the table made in #2 by the match columns ("FirstName", "LastInitial") to remove duplicates and get the unique keys of the joined table; remove the "N Rows" column.
In ArtClass, temporarily add a constant column "Present in ArtClass" set to all ones.
In GymClass, temporarily add a constant column "Present in GymClass" set to all ones.
In ArtClass, temporarily change "Score" to "Score of ArtClass" to disambiguate this column from the one in GymClass.
In GymClass, temporarily change "Score" to "Score of GymClass" to disambiguate this column from the one in ArtClass.
In the table made in #3, Update from each of the two temporarily-modified source tables (ArtClass and GymClass), matching on "FirstName" and "LastInitial" and adding all columns. (2 separate operations)
In the same table, select the columns "Present in ArtClass" and "Present in GymClass"; do Standardize Attributes and Recode the missing value to be 0 instead, so that the "presence" values are 0 or 1. This is the desired result, Students2.

Note that, in this procedure for making Students2, I had to manually segregate the conflicting non-key column, "Score", which the Join function had originally done for me automatically when making Students1 (by changing each "Score" to "Score of..."), which is what I wanted it to do. However, the problem with the Join function is that it does the same thing for the key columns, which I do not want it to do -- I want the key columns to be common for the entire joined table, with separate columns to flag the presence in each source table. My procedure above is not scalable to situations where you might have many more conflicting non-key columns for your data.

The only alternative method I've realized, is to use the Join function as before, but to manually add a formula column for every single match column that would combine the segregated pair of that match column (e.g. "FirstName" would be a formula column on "FirstName of ArtClass" and "FirstName of GymClass"), and also add formula columns for presence in each source table that would have to test if all of the segregated match columns corresponding to each source table are missing or not (for example, "Present in ArtClass" would be a forumula of whether both "FirstName of ArtClass" and "LastInitial of ArtClass" are missing). This is also a very tedious, roundabout process.

I'm wondering if there's a better, efficient solution for what I'm trying to do -- i.e. to generate Students2 in my example? Would it be with a script or add-in of some kind?

In addition, would SAS potentially consider something like this as a future feature extension for JMP?

Appreciate any assistance with this.

Jeff_Perkinson · Feb 3, 2015 05:54 PM

There's a "Merge same name columns" option in the Join dialog that will do as its name suggests and get you one set of First Name and Last Initial columns.

In your example, you've got a column named Score in each table so you'll need to rename them to Gym Score and Art Score. After doing that you can complete the Join dialog like this:

And you get a data table like this:

The Match Flag column indicates which tables contributed to each row. 1==Main Table, 2==With Table, 3 == Both.

I think that's what you're looking for, or awfully close anyway.

-Jeff

View solution in original post

Jeff_Perkinson · Feb 3, 2015 05:54 PM

There's a "Merge same name columns" option in the Join dialog that will do as its name suggests and get you one set of First Name and Last Initial columns.

In your example, you've got a column named Score in each table so you'll need to rename them to Gym Score and Art Score. After doing that you can complete the Join dialog like this:

And you get a data table like this:

The Match Flag column indicates which tables contributed to each row. 1==Main Table, 2==With Table, 3 == Both.

I think that's what you're looking for, or awfully close anyway.

-Jeff

a01 · Feb 3, 2015 06:20 PM

Ah, excellent! (Thanks for the fast response.) I guess that was right under my nose in a way, and I can't quite remember if I might have tried that checkbox a long time ago and forgotten about what it does. I'd say that solves the majority of this problem, because it's pretty simple to Recode "Match Flag" (e.g. into "ArtClass"/"GymClass"/"Both") or split it up into two separate "presence" columns if desired (like I had in Students2).

I suppose the only caveat here, however, is that there could be a lot of manual column renaming required if there are many analogous non-key columns with common names like "Score" in my example. It looks like if you don't rename the common non-key column(s), then apparently those column(s) in the resulting joined table seem to take the value arbitrarily from one source table or the other. (Perhaps there is some internal method as to which one "wins".)

In this way, the "Join" utility is giving us a choice between either:

(a) disambiguating everything automatically which we want but segregating the key columns which we don't want, when we don't select "Merge same column names", as in generating Students1 in my post, or

(b) commonly combining the match columns and giving a presence-indicating column, which we want, but letting non-key columns with common names conflict with each other, which we don't want -- when we do select "Merge same column names".

Do you know of a good workaround to this caveat?

Otherwise, thanks for pointing this out!

Jeff_Perkinson · Feb 4, 2015 08:52 AM

I think you've summarized the behaviors and limitations pretty well.

I'll pass them along to our development staff to see if we can make some improvements in a future release.

-Jeff