Measuring Genealogical Similarity using the Jaccard Index

For some of the posts on this blog I’ll be using one way to measure the similarity of two sample sets of data. The statistic is called the Jaccard Index, or the Jaccard Similarity Coefficient. This post is a technical explanation of the calculation itself.

The sets of data are the unique ancestral surnames of my DNA matches. The question I’m asking for any two of my matches is: how similar are their lists of direct ancestral surnames?

If two lists of unique surnames are identical, they will have the exact same surnames. They will also have the same number of surnames in their lists, as each surname is only represented once regardless of how many times it appears in the direct tree. They will be 100% similar.

However, I’m also interested in trees that are “nearly” the same. Suppose two siblings create separate trees, and both get as far as all their great great grandparents. Tom’s research leads him to one pair of 3rd greats, and Joe finds a different pair. Neither are aware yet of the other’s research, but both have one extra maiden name each in their trees. Those lists will be very similar, and I’d like to highlight their similarity in some way.

So I need a way of defining the “similarity” of two lists of surnames. The Jaccard Similarity Index compares two sets (or lists) to see which members (surnames) are shared and which are different. It calculates the percentage of similarity from 0 to 100%. The math is pretty simple, and is described here in understandable terms.

In the simplest terms, we count the intersection of the lists i.e. the number of surnames common to both trees. We count the differences for each side, and we count the total number of surnames in all. The Jaccard index expresses this mathematically as:

J(X,Y) = |X∩Y| / |X∪Y| or (|X∩Y| / |X| + |Y| – |X∩Y|

Taking our two brothers, Tom and Joe:
|X∩Y is the number of shared surnames: 8 for the brothers.
|X| is the length of the set, or the number of surnames for Tom’s tree: 9.
|Y| is the length of the set, or the number of surnames for Joe’s tree: also 9.

So our equation is: 8 / (9 + 9 – 8) * 100 = 80% similarity for our brothers.

If brother’s had exactly the same trees, they’d be 100% similar. If the postman’s tree had no overlapping surnames with the brothers, his index compared to both would be 0%.

So the ultimate task is to compare every surname list within my matches with every every other surname list. As the Jaccard index only works on two sets at a time, to calculate the similarity across N sets requires N squared calculations.
This becomes unfeasible for large numbers of sets, and there are other methods that can be brought into play to reduce processing time. I had about 4.4 million pairs of sets to compare, which took a matter of hours to complete.

Note that for my current purposes, I am using unique surnames. If one match has entered father, grandfather and great-grandfather John Smith, his list has Smith represented once. This is to simplify data collection and computation.

Note also that For my current purposes, the direction of surnames is unimportant. Match #1 may have a two-person tree with Mary Smith as the mother of Bob Jones, while Match #2 has Anne Jones as the mother of Bob Smith. That is “Smith->Jones” and “Jones->Smith”. If I include direction, these lists are different. I am treating the lists as a “bag of words”, where direction is not important – so these two lists “Jones, Smith”, and “Smith, Jones” are the same. This is to simplify data collection and computation.

Two caveats must be considered with the Jaccard Index. One is that it can be erroneous for small sample sizes, so I intend to exclude small trees.
The other problem for the index is when there are missing observations in the data sets. It’s safe to say that most of my lists have missing observations, as I’m not drawing from a sample of relatives with perfect trees to four generations. The trees tend to be ragged i.e. people know more about one branch than another.

How Many of My Ancestry Matches have Identical Trees copied between Accounts?

Something I find mildly disappointing with a new DNA match on Ancestry is when I realize that I’ve seen their tree before. The exact same identical tree. I’ll recognize this is the case if I’ve spent some considerable time studying the tree in a previous encounter from an earlier DNA match.

What’s happening is that someone is managing multiple accounts, and Ancestry provides the facility to copy the same tree across accounts. Suppose you match to two kits that have been assigned the same tree. When you click on “View Full Tree” from the pedigree page of both matches, the same tree URL is opened.

The reason I get disappointed is that I’m not going to glean much new insight from the DNA match other than an assumption that they are closely related to whoever they have a tree in common. But how often am I likely to stumble across this phenomenon?

In May 2018 I took a snapshot of information across all my DNA matches, and this included the Tree URLS. I used techniques described here, but I think that the DnaGedcom utility also retrieves the Tree URL to spreadsheets. The data allows me to do a little data mining on my snapshot of Tree URLs.

Question: How Many of my Ancestry Matches have Identical Trees copied between Accounts?

Approximately six thousand of my nine thousand DNA matches have an available tree with at least one visible entry. (The 6K excludes those pesky trees with everybody marked as private). 428 of those 6K have an identical tree with another account. Those 428 matches “should” have 428 trees, but they account for a total of 191 trees instead.
So about 7% of my matches with available trees.

It’s not a lot, but it does have some impact. I wrote before on the number of my matches with available trees. The charts I presented in that post are still correct, but the “usefulness” of a subset of those matches is reduced.

One caveat to these numbers: when a match has not linked their tree but has multiple unlinked trees, my snapshot examined the first tree in their list and ignored any others.

Question: Just How Big does this Tree-Sharing Get?

Within the subset of kits sharing a tree, the vast majority are in pairs i.e two matches sharing a tree.
The highest number of matches sharing the same tree is 8. Octuplets? Well, they all have the same surname.

Here’s the distribution:

The Top 10 Ancestral Surnames across my Ancestry DNA Matches Surprised Me

In May 2018 I downloaded the direct line ancestral surnames of all my DNA matches at Ancestry. I discussed some statistical analysis on the numbers in a previous post.

I then conducted some exploratory data mining of the distribution of surnames across my matches. My paternal line is African (not African-American), and I have very few paternal matches on Ancestry. My maternal line is Irish, and I expected the usual suspects of popular Irish surnames to appear in my own top 10 list. I’m talking Murphy, Ryan, Kelly and the like. I could imagine a pattern where my emigrant ancestors landed in the traditional “Irish” enclaves of New York or Boston, and married exclusively within the cohort of people they met from the old country. If their American descendants followed suit, then the distribution of surnames across my matches should skew Irish.

Data mining is about asking questions of the data, so here is the Q and A.

Question: What are the Top 10 Surnames across my Matches?

Before I crunched the numbers, my educated guess for Number 1 was Smith, which can be of Anglo-Saxon origin or of Irish origin. My maternal great-grandfather was a Smith. At least one of his sisters married a Smith. And it seems that I see Smiths everywhere I look among my match trees. Well, I wasn’t wrong on Number One, but the next 9 surprised me. Here is the distribution of the top 10 surnames across my Ancestry matches.

I do not recognize any of the next nine surnames from my known direct lineage.

Question: Is my top 10 Surname Distribution typical of Ireland (north and south)?

So I compared my distribution to a paper published by Sean J Murphy titled “A Survey of Irish Surnames 1992-97” . Murphy presents the top surnames on the entire island of Ireland (i.e. including Northern Ireland) based on data he gathered from 1992-1997. It’s a great read for anyone interested in Irish lineage. Because my Irish ancestors are predominantly Ulster, I’m working with Murphy’s numbers instead of recent data from the Irish Central Statistics Office, as theirs does not include Northern Ireland. Murphy provides raw numbers in tabular format, but I’ve plotted the distribution to allow a broad side-by-side comparison to my own:

Clearly the answer is no, the distribution across my matches doesn’t look similar to the distribution of surnames on the island of Ireland.

Question: Is my top 10 Surname Distribution typical of British surnames in Ireland?

I wouldn’t have thought of this without reading Murphy’s paper which has a section on “British Surnames in Ireland”. I have not established a genealogical link of my ancestors to British origin. But my distribution sure does look more similar to this breakdown than to the Irish one.

Before I draw any inferences, I’ll ask another question.

Question: Is my top 10 Surname Distribution typical of the United States?

I then compared my distribution to the USA census figures. I chose the 2000 census to be similar time frame for the Irish numbers. It might be better to use use earlier census figures: because living people are private, the available names in trees are likely to represent an earlier time frame. But its all very approximate, and I’m only interested in the broad distribution, which is certainly closer to mine than the Irish distribution.

The names “Garcia” and “Rodriguez” jump out at me because I’m not familiar with them. Checking my data, I have a sum total of 17 matches with a direct ancestry to Garcia or Rodriguez. My understanding of American demographics is that the last few decades will have pushed Hispanic surnames upward in frequency. So I narrowed the census numbers to filter on respondents who identified as white European lineage.

So after all that, I can see that the distribution of surnames across my Ancestry matches is closest to the USA white population.

Of course the vast majority of Ancestry customers are American. However if there was a very high tendency among my own emigrant ancestral relatives to keep themselves to themselves and only to marry within the typical Irish communities, then my top 10 distribution would surely be closer to the filter of the top 10 Irish surnames within the USA census. (I’m using the Ancestry blog as my source for these next numbers – they are using the 2000 census but I’m not sure how they got this particular filter).

So I don’t have any of the Irish American top 10 names amongst my top list. To be fair, I only have to go to #13 in my own distribution to hit “Murphy”, and I have “Kelly” at #18. But those two names are the only “Irish-American” names in my top 30. (Smith is the awkward outlier. It’s my personal #1 and I’m sure its being excluded, but it can also be Irish origin).

To Summarize:

Before I did this analysis I assumed that I’d have a higher distribution of “Irish” names across my matches. My Irish emigrant ancestors appear to have avoided a tendency only to marry other Irish descendants, but tended to marry within the local population of European heritage.