Discussions

JulieSAppel · Jul 6, 2020 11:58 AM

Hi,

I´m looking for some inspiration on how to approach a data set. I´m looking at demographics data for a group of people where I need to identify potential duplicate persons. Some parts of the data are accurate (birth date, sex, country) and other parts are not (height, weight). This means, that even though height & weight is not identical for two persons, they can actually be the same.

So within every "birth date, sex, country" combination, I need to check if differences in height and weight for the different potential pairs are very far apart or not.

I have looked into matched pairs but then it would be limited to only sets of two that can be identical and not e.g. sets of 3 or more. I have also looked into std dev for each "birth date, sex, country" combination but that would look at that entire group and not cluster those that seem to be very close within the group.

I have done some visualisations that gets me some of the way but with very large datasets coming up I need to be able to zoom in on the more realistic duplicates (e.g. body weight differences within 5 kg and height within 2 cm).

What would be a good approach here?

Br Julie

dale_lehman · Jul 6, 2020 12:38 PM

My understanding of your data is that each "person" in the data set is identified as if they are a unique person, but even when their birthdate, sex, and country are the same, they may or may not be different people. I think you are wanting to see if their heights and weights (which are subject to measurement error) are "close enough" to believe they are really the same person - if they are not "close enough" then they will be considered different people (albeit with the same birthdate, sex, and country). If that is correct, then the Tabulate script that I attached to your sample data may help. Others may suggest better ways, but this might work. If you run the script and save the tabulation as a data table, you can then use functions such as ColRank, ColMin, and ColMax to investigate the cases where there are multiple potential people.

This is a bit messy, and I'll be interested to see if someone has a better way to do this.

martindemel · Jul 10, 2020 07:34 AM

Have you thought about clustering with some prearrangements of the data?

1. Cols Menu, Utilities, Combine Columns for Birth, Sex, Country (columns need to be selected in advance) . Using "_" (underscore) as delimiter. This will create a new column where you have these combinations. As you mentioned you want to check only for each combination of these three fixed parameters it is a nice effective way to achive this grouping of actually three columns to just one column.

2. Hierarchical Cluster or KMeans Cluster for Height and Weight with label Student ID, and the new group column as BY GROUP, can be done.

As you will have many combinations where just one person is in the data for a group, you will get a warning. For the rest you will get dendograms for each group. If they are likely too be similar they will be put into the same cluster.

Just another approach, which will work better if you'd have more data. Attached a modified file

/****NeverStopLearning****/

Discussions

Comparing all pairs within groups

Re: Comparing all pairs within groups

Re: Comparing all pairs within groups

Recommended Articles