Ten Months of Growth on AncestryDNA

In July 2017 I reviewed all my DNA matches on AncestryDNA by recording  the shared CM number of every match. I repeated the exercise in February 2018 and again in early May 2018. I used techniques I describe in this blog post.

I was particularly interested in:

(1) the rate of growth of matches over this 10 month period of heavy marketing by the company

(2) the distribution of matches by shared CM for each given period e.g. my total number of matches below 10 CM, between 10 and 20 CM, and above 20 CM

(3) whether the distribution of matches stayed similar during this period of growth e.g. are the growing numbers solely due to tiny 6 CM matches

(4) what was the effect of the opt-out policy introduced in November 2017 e.g. did I lose a lot of matches who opted out

This blog post looks in detail at trends within my personal data.

(1) the rate of growth of matches over this 10 month period

Before I break down the matches by centimorgan, here are the total number of matches at each snapshot in time: 5,100 in July 2017, 7,494 in Feb 2018 and 9,026 in May 2018

American or Irish readers may think my actual numbers are pretty small. My heritage is half Irish and half African, and so I have very few matches on one side. Those with two Irish parents and grandparents may have twice my totals, while U.S. readers may have much higher totals than mine.

But it’s the trend that is of particular interest: the total matches near doubled across a ten month period!

It would have been nice to have evenly distributed snapshots e.g. quarterly, but the results do show that the rate of growth accelerated in 2018. I assume that the reason is due to increased marketing spend resulting in more people testing. I’m not aware of changes in matching methods in 2018 which might have increased matches among the existing sample base.

Those of you who have been using Ancestry for a while will remember that once upon a time the display showed the total number of matches, but that has been removed. There is a way of calculating the number without any coding, which I documented in this blog post.

(2) the distribution of matches by shared CM for each given snapshot

So this growth is impressive at first glance, but may be fairly useless if just one of the extra 4,000 matches are 4th cousins or closer, and the rest are a lowly 6.0 CM match. I am particularly interested in the number of matches broken down by CM, which Ancestry does not provide via the website. I used techniques documented here to record the CM of all my matches to one decimal point. But knowing the number of matches at 6.9 CM versus 6.8 isn’t of particular use. Instead, I’m going to roll up the matches into four ranges:

  • 6 CM to 6.9 CM
  • 7 CM to 9.9 CM
  • 10 CM to 19.9 CM
  • 20 CM and over

This is also known as segmenting the data and I would more naturally refer to these ranges as segments, but I don’t want to confuse with the genetic terminology of segments.

Why these ranges? It allows me to separate “the wheat from the chaff” in terms of how confident I can be that

(a) the match is genuine instead of due to random factors of DNA inheritance and

(b) I have a reasonable chance of investigating the common ancestor (easier for 3rd cousins than for genuine 4th cousins).

20 CM and over represents approximately the relatedness of fourth cousin and closer, with a high. These are the “lowest hanging fruit” i.e. the matches I’m most likely to focus on first to try to establish the precise family tree relationship. I can also be confident that these are genuine matches and unlikely to include false positives due to the random nature of DNA inheritance and recombination.

Blaine Bettinger has a good article on the danger zones below 20 CM based on his own results, while this ISOGG article is based on wider academic research. The conclusion is that matches from 10 CM and upward have a 94% probability of being genuine.

My lowest range is set to every match that is less than 7 CM, which represents those matches of which we can be least sure are genuine DNA matches as opposed to coincidence. Ancestry doesn’t show matches below 6.0 CM anyway.

There is a grey area between 7 and 10 CM where the certainty of a match is above 50% but below 94%. Users with hundreds of fourth cousin matches may choose to ignore the 7-10 CM range. Those of us with less matches may pay more attention to this range while recognizing that a significant percentage will not be useful. Thefore I’ve defined an “upper-middle” range of 10-19.9 CM and a “lower-middle” range of 7-9.9 CM.

Here are the raw numbers:

And here are those numbers in a bar chart:

Clearly all ranges are trending upwards. My “20 CM and over” rose from 55 to 87 to 91. A few months later, though I haven’t done full analysis, I was delighted to tip over the 100 mark. My next zone of interest, “10 to 19.9 CM” rose from 893 to 1,331 to 1,585.

(3) whether the distribution of matches stayed similar during this period of growth

The next figure shows the breakdown by % of total matches. What is immediately clear is that the percentages remain similar through the growth period. That is no great surprise. One scenario I can think of where the close CM percentage might increase is if I (or a close cousin) encouraged a large number of family members to test in the same time period which could cause a spike at the higher CM level. But as distant relatives would be behaving similarly, it would even out over time.

The percentages for “20 CM and above” are a little hard to read in the graphic, because its proportionally small. They read: 1.1%, 1.2%, 1.0%.

(4) the effect of the opt-out policy introduced in November 2017

Finally, I want to address the impact of the introduction of the opt-out clause in November 2017. There were concerns at the time amongst the blogosphere that matches might drop off a cliff as people rushed to hide themselves. The short answer is that according to my results, there is no need to be concerned.

I have documented the results in a separate blog post as it may be of more general interest than the detailed numbers presented in this blog.

Ancestry Opt-Outs are in low numbers

Ancestry introduced a new privacy policy in November 2017 that allowed customers to opt out of DNA matching. Customers who opt out receive their heritage breakdown, but do not see their DNA matches. They also are not shown on the match lists of all other customers.

Prior to this change, customers who wished to disengage would have to delete their DNA completely.

When the change was first announced, there was some concern among Ancestry users that the number of shared matches would reduce significantly if many existing customers chose to opt out. Then users noticed “missing” matches when they went to review past notes and manual lists. It didn’t help that there was also erratic behaviour of the Surname/Location Search functionality.

It’s difficult to be sure about the impact without snapshot lists of matches before and after the change. Unfortunately Ancestry does not provide the facility to download matches to spreadsheet, unlike some other companies.

In July 2017 I personally recorded details about all my DNA matches on AncestryDNA. I repeated the exercise in February 2018 and again in May 2018. I used techniques I describe here.

It’s just luck that one of my snaphots was before the change, but it allows me to do some analysis of the impact on my own results of people choosing to disengage from matching.

So, how many matches were in my snapshot of July 2017 that *were not* in my snapshot of February 2018?

A grand total of five matches disappeared.

I don’t know if they opted out or if they completely deleted their DNA and closed their accounts. However none of these five had reappeared by May 2018.

I wrote a separate blog post with full analysis and breakdown of my total numbers. As I gained gained 2,394 matches over the same time period, the opt-out number is insignificant. As that post also shows that over 80% of my matches were below 10 CM, statistically these low opt-outs are most likely to be low CM. I could have been really unlucky that one or two of those matches happened to be close, but I happen to know that they were all 10 CM or below.

Between February and May 2018 a further two people disappeared from my match list.

Again, they were of minor CM.

I acknowledge that these five could have been very close matches for some existing customers seeking to solve mysteries. The impact to those customers is of course highly significant. But overall, this is not a worrying factor for me in my continuing use of AncestryDNA.

How to count your total number of matches on AncestryDNA

Those of you who have been using Ancestry for a while will remember that once upon a time the display showed the total number of matches, but that has been removed. There is a way of calculating the number using the website alone, which just requires a bit of methodical clicking to identify the last page of your results.

Let’s say you identify the last page as page number 183 which shows a shortened list of only 12 matches. We know that there are 50 matches displayed on a full page, so you have (182*50) + 12 matches at that point in time i.e. 9,112 matches.

So how do you identify the last page?

You make use of the box at the top right of the page that allows you to jump to a specific page number.

If you enter a page number that is beyond the number of matches in your list, you will be shown this display:

So the trick is to find the page number directly before the lowest page which shows “nothing found”. It’s not as complicated as it reads.

The first thing to do is to enter a guesstimate number that is probably past your number of matches e.g. 300 (this would require 15,000 matches not to display the empty page).

What you do next depends on whether you get results or not on this candidate as the last page. I’ll describe both scenarios as a blow-by-blow.

(A) I see matches on page 300

If you actually get results, then keep entering an increased number in increments of 50 (an arbitrary choice) until you get the “no matches found” display.

Lets say you entered page 350 and get a full list of matches, but see “no matches found” on page 400. You know that somewhere between 350 and 400 is the last page of matches.

At this point you could work backwards from 350 in decrements of 1 until you find the highest page that shows matches. As that could potentially be a tedious 49 clicks, you can use a technique known as the binary chop: drop down by half your current gap i.e. to page 375 (half of 50). If no match is found, drop down by a further half (half of half of 50) i.e. 363, and then to 357, then 353, and finally 351. At some point in this sequence of maximum 5 clicks, you’ll hit a page that has either has a full list or a shortened list. If it’s the shortened list, you’ve found the last page (you can verify by hitting the next button to see “no matches found”). If you hit the full list, you just need to go up a page or two to find the shortened list.

(B) I see no matches on page 300

Keep entering page numbers in decrements of 50 (an arbitrary choice) until you see a page with matches.

If it’s a shortened list, then quickly verify that it is indeed the last page by using the next button to go up by 1 page.

If it’s a full page then you could at this point work upwards in increments of 1 until you find the first page that shows no matches. That is potentially 49 clicks which is tedious. You can use a technique known as the binary chop to save time.

Let’s say you dropped from page 250 to 200 where you found a full list of 50 matches. Go up by half your current gap i.e. to page 225. If matches are shown, then go up again by a further half (half of half of 50) i.e. 237, and then to 243, then 246, and finally 249. Eventually you’ll hit a page that has no matches or a shortened list. If it’s the shortened list, then bingo, you’ve found the last page (you can verify by hitting the next button to see “no matches found”). If you hit a page with matches, you need to go up a page or two to find the shortened list.

And that’s it! Now yust use the calculation: ((last page – 1) * 50) + remainder e.g. (182 * 50) + 12

Then make a note somewhere of what the actual last page is, so if you want to do another count in the future you can add an arbitrary gap and work back down with the binary chop.

A brief technical overview of data mining dna matches on Ancestry

The purpose of this blog is to present some analysis of the available data from several DNA testing companies for one or more DNA kits. This post is a high-level description of how the data is retrieved and analysed from AncestryDNA.

AncestryDNA presents “Match List Pages” with each match as a row displaying the user name and a URL link. There are 50 matches per page. A kit owner may have hundreds of Match List Pages. We navigate through the pages by next/previous buttons at the top and bottom of the page. We can also jump to a specific page number by manipulating the numerals at the end of the Match List Page URL. This allows us to iterate through our pages by incrementing the URL page number by 1 until we come to a page that tells us there are no more matches to be displayed. Note that the page doesn’t throw an error, which is nice for programmatic reasons.

no matches found

A specific match has a number of attributes that are viewable from its “Match Page”, which can be opened by clicking on the match link on the “Match List Page”. The “Match Page” can also be opened directly (when logged in, of course) by entering the address of the specific URL in the browser.

The “Match Page” has three tabs with data of interest (tree, shared matches, geography). It also has attributes available from pop-up windows. The tabbed displays are loaded at run-time i.e. the list of shared matches are not available when the Match Page opens to the default Tree tab. These factors ensure that viewing all the attributes of interest require a number of clicks, plus some latency in waiting for the new information to be displayed.

These are some of the attributes of interest:

  • How many centimorgans and segments is the match?
  • Does the match have a tree?
  • If the match has a tree, is it public?
  • If the match has a tree, is the kit linked to the tree?
  • If the match has a public tree, then the list of direct ancestor names displayed in the first tab becomes an attribute of interest.
  • If there are shared matches, then the “Shared Match” tab can be selected to display the list of shared matches.

To collect the attributes of interest for every match of a DNA kit, we simply have to:

open the “first” Match List Page

REPEAT

FOR all the rows (match links) on the Match List page

open the Match Page

collect the attributes of interest (needs a few clicks)

open the “next” Match List Page

UNTIL there are no more Match List Pages

When I first used Ancestry, I started to do this manually for my 4th cousin matches. After a very tedious Sunday of clicks, I looked online for an automated solution. I found several utilities and browswer add-ons that will do the job and produce spreadsheets of attributes. They’re very good, but didn’t quite suit my purposes. So I devised my own process to deliver the same basic functionality.

Technically, I use Python scripts and a number of Python libraries that facilitate the interaction with web pages. I load my data to SQL and NoSQL data technologies to facilitate data analysis. I also use R for data analysis and presentation.

The add-ons and utilities would not be needed for Ancestry if the company would provide users with the facility to download their match data to file. They don’t seem to intend to provide this in the foreseeable future.

If any of us were to embark on manually retrieving all our match data from Ancestry, it would take weeks of clicking and copying.  The utilities (mine and others) also take time to complete. The common factor is the number of matches that we have, but I expect most users will need many hours for any utility to complete.