One of the things that I'm struggling with in JMP is the fact that, when I want to do a many-for-many table join using the "Join" function on matching columns, then the match columns in the resulting table always get/stay segregated between each of the two source tables. JMP doesn't seem to provide an automatic or convenient way to get the common match column(s) with the values filled in for all rows regardless of which source table the key was present in. It would be able to get such common fully-filled match column(s), with "presence" columns to indicate whether that key was present in each source table.
Let me illustrate with an example. Suppose I have two tables, "ArtClass" and "GymClass". The "primary key" in each table is the combination of "FirstName" and "LastInitial".
ArtClass:
FirstName | LastInitial | Favorite color | Score |
---|
John | S | Red | 4 |
John | B | Blue | 5 |
Alice | A | Green | 6 |
Alice | M | Yellow | 7 |
Bob | S | Purple | 6 |
Sally | C | Blue | 5 |
GymClass:
FirstName | LastInitial | Favorite activity | Score |
---|
John | S | Monkey bars | 8 |
John | A | Shot put | 7 |
Alice | P | Soccer | 6 |
Bob | S | Basketball | 5 |
Sally | C | Field hockey | 5 |
Bob | M | Crunches | 6 |
If I do a Join on Matching Columns, equating FirstName and LastInitial, without dropping multiples and including non-matches in both tables (an "outer join"), then I obtain the following, which we'll call "Students1" (Edit: Please forgive me if any of these tables are getting cut off the screen; maybe try copying this into a MS Word document because of how this webpage displays.):
Students1:
FirstName of ArtClass | LastInitial of ArtClass | Favorite color | Score of ArtClass | FirstName of GymClass | LastInitial of GymClass | Favorite activity | Score of GymClass |
---|
| | | | John | A | Shot put | 7 |
| | | | Alice | P | Soccer | 6 |
| | | | Bob | M | Crunches | 6 |
John | S | Red | 4 | John | S | Monkey bars | 8 |
John | B | Blue | 5 | | | |
Alice | A | Green | 6 | | | |
Alice | M | Yellow | 7 | | | |
Bob | S | Purple | 6 | Bob | S | Basketball | 5 |
Sally | C | Blue | 5 | Sally | C | Field hockey | 5 |
What I would much rather get is the following kind of table, which we'll call "Students2":
Students2:
FirstName | LastInitial | Favorite color | Score of ArtClass | Present in ArtClass | Favorite activity | Score of GymClass | Present in GymClass |
---|
Alice | A | Green | 6 | 1 | | | 0 |
Alice | M | Yellow | 7 | 1 | | | 0 |
Alice | P | | | 0 | Soccer | 6 | 1 |
Bob | M | | | 0 | Crunches | 6 | 1 |
Bob | S | Purple | 6 | 1 | Basketball | 5 | 1 |
John | A | | | 0 | Shot put | 7 | 1 |
John | B | Blue | 5 | 1 | | | 0 |
John | S | Red | 4 | 1 | Monkey bars | 8 | 1 |
Sally | C | Blue | 5 | 1 | Field hockey | 5 | 1 |
Notice how in the second version, I have just one common set of the matching columns -- FirstName and LastInitial -- which is fully populated for all rows. I also have columns to tell me if the key was present in each of the source tables (ArtClass and GymClass). In contrast, in the first version (Students1), the only way you can tell if each key was present in each source table, is by seeing if the segregated key columns ("FirstName of ArtClass"/"LastInitial of ArtClass", "FirstName of GymClass"/"LastInitial of GymClass") are missing or present.
A way that I contrived the second version, Students2, was by doing the following roundabout procedure:
- Subset or Summary each of the two source tables (ArtClass and GymClass) by the match columns, "FirstName" and "LastInitial".
- Concatenate the two subset tables made in #1.
- Summary the table made in #2 by the match columns ("FirstName", "LastInitial") to remove duplicates and get the unique keys of the joined table; remove the "N Rows" column.
- In ArtClass, temporarily add a constant column "Present in ArtClass" set to all ones.
- In GymClass, temporarily add a constant column "Present in GymClass" set to all ones.
- In ArtClass, temporarily change "Score" to "Score of ArtClass" to disambiguate this column from the one in GymClass.
- In GymClass, temporarily change "Score" to "Score of GymClass" to disambiguate this column from the one in ArtClass.
- In the table made in #3, Update from each of the two temporarily-modified source tables (ArtClass and GymClass), matching on "FirstName" and "LastInitial" and adding all columns. (2 separate operations)
- In the same table, select the columns "Present in ArtClass" and "Present in GymClass"; do Standardize Attributes and Recode the missing value to be 0 instead, so that the "presence" values are 0 or 1. This is the desired result, Students2.
Note that, in this procedure for making Students2, I had to manually segregate the conflicting non-key column, "Score", which the Join function had originally done for me automatically when making Students1 (by changing each "Score" to "Score of..."), which is what I wanted it to do. However, the problem with the Join function is that it does the same thing for the key columns, which I do not want it to do -- I want the key columns to be common for the entire joined table, with separate columns to flag the presence in each source table. My procedure above is not scalable to situations where you might have many more conflicting non-key columns for your data.
The only alternative method I've realized, is to use the Join function as before, but to manually add a formula column for every single match column that would combine the segregated pair of that match column (e.g. "FirstName" would be a formula column on "FirstName of ArtClass" and "FirstName of GymClass"), and also add formula columns for presence in each source table that would have to test if all of the segregated match columns corresponding to each source table are missing or not (for example, "Present in ArtClass" would be a forumula of whether both "FirstName of ArtClass" and "LastInitial of ArtClass" are missing). This is also a very tedious, roundabout process.
I'm wondering if there's a better, efficient solution for what I'm trying to do -- i.e. to generate Students2 in my example? Would it be with a script or add-in of some kind?
In addition, would SAS potentially consider something like this as a future feature extension for JMP?
Appreciate any assistance with this.