Hi,
I´m looking for some inspiration on how to approach a data set. I´m looking at demographics data for a group of people where I need to identify potential duplicate persons. Some parts of the data are accurate (birth date, sex, country) and other parts are not (height, weight). This means, that even though height & weight is not identical for two persons, they can actually be the same.
So within every "birth date, sex, country" combination, I need to check if differences in height and weight for the different potential pairs are very far apart or not.
I have looked into matched pairs but then it would be limited to only sets of two that can be identical and not e.g. sets of 3 or more. I have also looked into std dev for each "birth date, sex, country" combination but that would look at that entire group and not cluster those that seem to be very close within the group.
I have done some visualisations that gets me some of the way but with very large datasets coming up I need to be able to zoom in on the more realistic duplicates (e.g. body weight differences within 5 kg and height within 2 cm).
What would be a good approach here?
Br Julie