A brief technical overview of data mining dna matches on Ancestry

The purpose of this blog is to present some analysis of the available data from several DNA testing companies for one or more DNA kits. This post is a high-level description of how the data is retrieved and analysed from AncestryDNA.

AncestryDNA presents “Match List Pages” with each match as a row displaying the user name and a URL link. There are 50 matches per page. A kit owner may have hundreds of Match List Pages. We navigate through the pages by next/previous buttons at the top and bottom of the page. We can also jump to a specific page number by manipulating the numerals at the end of the Match List Page URL. This allows us to iterate through our pages by incrementing the URL page number by 1 until we come to a page that tells us there are no more matches to be displayed. Note that the page doesn’t throw an error, which is nice for programmatic reasons.

no matches found

A specific match has a number of attributes that are viewable from its “Match Page”, which can be opened by clicking on the match link on the “Match List Page”. The “Match Page” can also be opened directly (when logged in, of course) by entering the address of the specific URL in the browser.

The “Match Page” has three tabs with data of interest (tree, shared matches, geography). It also has attributes available from pop-up windows. The tabbed displays are loaded at run-time i.e. the list of shared matches are not available when the Match Page opens to the default Tree tab. These factors ensure that viewing all the attributes of interest require a number of clicks, plus some latency in waiting for the new information to be displayed.

These are some of the attributes of interest:

  • How many centimorgans and segments is the match?
  • Does the match have a tree?
  • If the match has a tree, is it public?
  • If the match has a tree, is the kit linked to the tree?
  • If the match has a public tree, then the list of direct ancestor names displayed in the first tab becomes an attribute of interest.
  • If there are shared matches, then the “Shared Match” tab can be selected to display the list of shared matches.

To collect the attributes of interest for every match of a DNA kit, we simply have to:

open the “first” Match List Page

REPEAT

FOR all the rows (match links) on the Match List page

open the Match Page

collect the attributes of interest (needs a few clicks)

open the “next” Match List Page

UNTIL there are no more Match List Pages

When I first used Ancestry, I started to do this manually for my 4th cousin matches. After a very tedious Sunday of clicks, I looked online for an automated solution. I found several utilities and browswer add-ons that will do the job and produce spreadsheets of attributes. They’re very good, but didn’t quite suit my purposes. So I devised my own process to deliver the same basic functionality.

Technically, I use Python scripts and a number of Python libraries that facilitate the interaction with web pages. I load my data to SQL and NoSQL data technologies to facilitate data analysis. I also use R for data analysis and presentation.

The add-ons and utilities would not be needed for Ancestry if the company would provide users with the facility to download their match data to file. They don’t seem to intend to provide this in the foreseeable future.

If any of us were to embark on manually retrieving all our match data from Ancestry, it would take weeks of clicking and copying.  The utilities (mine and others) also take time to complete. The common factor is the number of matches that we have, but I expect most users will need many hours for any utility to complete.

4 thoughts on “A brief technical overview of data mining dna matches on Ancestry”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.