Measuring Genealogical Similarity using the Jaccard Index

For some of the posts on this blog I’ll be using one way to measure the similarity of two sample sets of data. The statistic is called the Jaccard Index, or the Jaccard Similarity Coefficient. This post is a technical explanation of the calculation itself.

The sets of data are the unique ancestral surnames of my DNA matches. The question I’m asking for any two of my matches is: how similar are their lists of direct ancestral surnames?

If two lists of unique surnames are identical, they will have the exact same surnames. They will also have the same number of surnames in their lists, as each surname is only represented once regardless of how many times it appears in the direct tree. They will be 100% similar.

However, I’m also interested in trees that are “nearly” the same. Suppose two siblings create separate trees, and both get as far as all their great great grandparents. Tom’s research leads him to one pair of 3rd greats, and Joe finds a different pair. Neither are aware yet of the other’s research, but both have one extra maiden name each in their trees. Those lists will be very similar, and I’d like to highlight their similarity in some way.

So I need a way of defining the “similarity” of two lists of surnames. The Jaccard Similarity Index compares two sets (or lists) to see which members (surnames) are shared and which are different. It calculates the percentage of similarity from 0 to 100%. The math is pretty simple, and is described here in understandable terms.

In the simplest terms, we count the intersection of the lists i.e. the number of surnames common to both trees. We count the differences for each side, and we count the total number of surnames in all. The Jaccard index expresses this mathematically as:

J(X,Y) = |X∩Y| / |X∪Y| or (|X∩Y| / |X| + |Y| – |X∩Y|

Taking our two brothers, Tom and Joe:
|X∩Y is the number of shared surnames: 8 for the brothers.
|X| is the length of the set, or the number of surnames for Tom’s tree: 9.
|Y| is the length of the set, or the number of surnames for Joe’s tree: also 9.

So our equation is: 8 / (9 + 9 – 8) * 100 = 80% similarity for our brothers.

If brother’s had exactly the same trees, they’d be 100% similar. If the postman’s tree had no overlapping surnames with the brothers, his index compared to both would be 0%.

So the ultimate task is to compare every surname list within my matches with every every other surname list. As the Jaccard index only works on two sets at a time, to calculate the similarity across N sets requires N squared calculations.
This becomes unfeasible for large numbers of sets, and there are other methods that can be brought into play to reduce processing time. I had about 4.4 million pairs of sets to compare, which took a matter of hours to complete.

Note that for my current purposes, I am using unique surnames. If one match has entered father, grandfather and great-grandfather John Smith, his list has Smith represented once. This is to simplify data collection and computation.

Note also that For my current purposes, the direction of surnames is unimportant. Match #1 may have a two-person tree with Mary Smith as the mother of Bob Jones, while Match #2 has Anne Jones as the mother of Bob Smith. That is “Smith->Jones” and “Jones->Smith”. If I include direction, these lists are different. I am treating the lists as a “bag of words”, where direction is not important – so these two lists “Jones, Smith”, and “Smith, Jones” are the same. This is to simplify data collection and computation.

Two caveats must be considered with the Jaccard Index. One is that it can be erroneous for small sample sizes, so I intend to exclude small trees.
The other problem for the index is when there are missing observations in the data sets. It’s safe to say that most of my lists have missing observations, as I’m not drawing from a sample of relatives with perfect trees to four generations. The trees tend to be ragged i.e. people know more about one branch than another.

A brief technical overview of data mining dna matches on Ancestry

The purpose of this blog is to present some analysis of the available data from several DNA testing companies for one or more DNA kits. This post is a high-level description of how the data is retrieved and analysed from AncestryDNA.

AncestryDNA presents “Match List Pages” with each match as a row displaying the user name and a URL link. There are 50 matches per page. A kit owner may have hundreds of Match List Pages. We navigate through the pages by next/previous buttons at the top and bottom of the page. We can also jump to a specific page number by manipulating the numerals at the end of the Match List Page URL. This allows us to iterate through our pages by incrementing the URL page number by 1 until we come to a page that tells us there are no more matches to be displayed. Note that the page doesn’t throw an error, which is nice for programmatic reasons.

no matches found

A specific match has a number of attributes that are viewable from its “Match Page”, which can be opened by clicking on the match link on the “Match List Page”. The “Match Page” can also be opened directly (when logged in, of course) by entering the address of the specific URL in the browser.

The “Match Page” has three tabs with data of interest (tree, shared matches, geography). It also has attributes available from pop-up windows. The tabbed displays are loaded at run-time i.e. the list of shared matches are not available when the Match Page opens to the default Tree tab. These factors ensure that viewing all the attributes of interest require a number of clicks, plus some latency in waiting for the new information to be displayed.

These are some of the attributes of interest:

  • How many centimorgans and segments is the match?
  • Does the match have a tree?
  • If the match has a tree, is it public?
  • If the match has a tree, is the kit linked to the tree?
  • If the match has a public tree, then the list of direct ancestor names displayed in the first tab becomes an attribute of interest.
  • If there are shared matches, then the “Shared Match” tab can be selected to display the list of shared matches.

To collect the attributes of interest for every match of a DNA kit, we simply have to:

open the “first” Match List Page

REPEAT

FOR all the rows (match links) on the Match List page

open the Match Page

collect the attributes of interest (needs a few clicks)

open the “next” Match List Page

UNTIL there are no more Match List Pages

When I first used Ancestry, I started to do this manually for my 4th cousin matches. After a very tedious Sunday of clicks, I looked online for an automated solution. I found several utilities and browswer add-ons that will do the job and produce spreadsheets of attributes. They’re very good, but didn’t quite suit my purposes. So I devised my own process to deliver the same basic functionality.

Technically, I use Python scripts and a number of Python libraries that facilitate the interaction with web pages. I load my data to SQL and NoSQL data technologies to facilitate data analysis. I also use R for data analysis and presentation.

The add-ons and utilities would not be needed for Ancestry if the company would provide users with the facility to download their match data to file. They don’t seem to intend to provide this in the foreseeable future.

If any of us were to embark on manually retrieving all our match data from Ancestry, it would take weeks of clicking and copying.  The utilities (mine and others) also take time to complete. The common factor is the number of matches that we have, but I expect most users will need many hours for any utility to complete.