Measuring the Usefulness of Trees of my Ancestry DNA Matches

Previously I discussed an analysis of the availability of trees across all my DNA matches. The headline figures from May 2018 were

  • 40% of my matches had a public linked tree
  • 27% had no linked tree but had at least one public unlinked tree
  • 7% had a private tree
  • 26% had no tree

A total of (linked + unlinked) 67% of matches with at least one public tree at first glance would seem to be a very positive outcome.

To analyse usefulness I proceeded to drill down further into the actual content of those trees. Specifically, my May analysis included a download of the Direct Ancestor Surname lists provided by Ancestry on the “Pedigree and Surname” page for each match where a tree is either displayed by default (the linked tree) or has been selected from a list of unlinked trees. I use my own utility but the DNAGedcom Client will also download surname lists to spreadsheet to allow analysis. I don’t know if it grabs unlinked trees, mine does to a limited degree (just one per match).

It was immediately clear that many matches had a tree, but had no available surname list. The reason was simple: the tree was public but all entries had been set to a status of living, so their details were hidden:

Everybody Lives!

This was a significant 11% of my matches with a public linked or unlinked tree, or a rounded 8% of my total matches, as reflected in the revised usefulness proprotion:

Single Surname Trees

Aside from the I-see-no-dead-people trees, the other happiness-killers are the loner trees with but a single visible entry:

I also include in this category those trees with multiple generations of a single surname and no visible spouses. These single surname trees were a very similar (slightly lower) percentage to the “Everybody Lives” category. The usefulness proportion is again reduced:

Variety is the Spice of Life

My grandmother was a Smith, and so it seems were all the neighbours. But research-wise I could have had it worse, like this guy:

Aside from Smith, the two other names I’ve blanked for privacy are common Irish surnames. If you’re thinking Murphy and Reilly, you’re half right. With my current research I find that the most useful trees are those with a good variety of ancestral surnames, certainly more than the three in this example. A high variety will indicate a higher number of ancestral generations represented in the tree.

I titled this post as “The Usefulness of Trees”, but usefulness is in the eye of the beholder. If I was searching for living relatives (e.g. as an adoptee) and this was a close match, this might be gold if the match is prepared to reply to messages.

For now, I will define the potential usefulness of a tree by the number of distinct surnames in the ancestral list. That won’t be the case for everyone, and it may change for me if the focus of my research changes. That being said, I now want to get a measure of how many matches with potentially useful trees that I have.

The highest number of distinct surnames amongst my matches has 272 distinct direct surnames. The tree goes back to the early 1600s and has over 12K people. The rest of my matches are distributed over a range of numbers.

For illustrative purposes, the next figure shows where my matches fall into bands of “number of distinct ancestral surnames”. I’ve already noted the lowest two bands: zero and the single surname crowd.

Above zero or one surnames is the band of low variety: I’ve fairly arbitrarily put this as 2, 3 or 4 distinct surnames. This is where I also think little immediate value is to be had. I’d probably need to build out a research tree, which is why I say there is little *immediate* value to me. It’s about 30% of my public linked/unlinked trees. That eats into my usefulness as so:

That leaves me with about 32% of matches with “useful” trees for my current purposes. For me personally, that’s currently at least 2,870 trees.

Analysis of Tree Availability of My Matches on Ancestry DNA

In May 2018 I conducted a review of all my DNA matches on Ancestry and recorded whether the match had a public or private tree, and whether a tree was linked to their DNA. I used techniques I describe here.

The headline numbers are: 40% of my matches have a public linked tree and 27% have at least one public unlinked tree. 7% have only a private tree. 26% had no tree at all.

The headline numbers don’t capture whether the tree is actually useful. Of the 67% of matches with an available tree, 11% have everybody set to living status, so their details are hidden. Just to be clear, that’s a rounded 8% of my total matches, which is a significant number and is close to the 9% of matches who have set their full tree to private.

I’ll discuss a more detailed assessment of the “usefulness” of trees in a separate post. Here, I will show some comparisons of the headline numbers with other Ancestry users who have conducted similar analysis.

Blaine Bettinger reported his breakdown in January 2017 based on a manual analysis of his top 500 matches. I think it’s a reasonable sample. Martin Abrams commented on Blaine’s posts with a manual survey of his top 200 matches. Here’s a side by side comparison of my numbers with the two gentlemen. My breakdown is remarkably similar to Martin Abram’s figures, while Blaine seems to have all the luck in terms of low “No Tree” matches.

I searched for other reported numbers, but others seem to be using the DnaGedcom utility for analysis and that excellent utility does not identify private versus unattached trees. I’ve put the technical details of my own method of analysis at the end of this blog post.

There is no particular conclusion to be drawn based on three people from separate time periods, other than a personal assessment as to whether Ancestry continues to be useful to me for my particular research interests. As new matches trickle in, if they become predominantly “No Tree” then that would be problematic.

There is a worry amongst genetic genealogy enthusiasts that Ancestry’s marketing focus on ethnicity reports is bringing in swathes of new matches with no interest in creating trees. Readers of prior posts may remember that I ran an analysis of all my matches on two prior occasions including June 2017. Unfortunately I wasn’t capturing tree information at those times. I do intend to repeat the snapshot in August and beyond so I will start to get trends within my own data.

Technical Details

My own utility identifies private trees with a simple text search for Ancestry’s phrasing on the Match page: “family tree is private”.

For unlinked trees, there is a drop down list box on the Match page inviting you to select a tree from a list. Even if the list only has one tree, the list box is present: this allows me to identify matches with unlinked tree(s). My assumption here is that the listbox will not include any private trees i.e. if the match only has an unlinked private tree, that Ancestry will display their status identically to the “No Tree” people. I’m open to correction here!

Ten Months of Growth on AncestryDNA

In July 2017 I reviewed all my DNA matches on AncestryDNA by recording  the shared CM number of every match. I repeated the exercise in February 2018 and again in early May 2018. I used techniques I describe in this blog post.

I was particularly interested in:

(1) the rate of growth of matches over this 10 month period of heavy marketing by the company

(2) the distribution of matches by shared CM for each given period e.g. my total number of matches below 10 CM, between 10 and 20 CM, and above 20 CM

(3) whether the distribution of matches stayed similar during this period of growth e.g. are the growing numbers solely due to tiny 6 CM matches

(4) what was the effect of the opt-out policy introduced in November 2017 e.g. did I lose a lot of matches who opted out

This blog post looks in detail at trends within my personal data.

(1) the rate of growth of matches over this 10 month period

Before I break down the matches by centimorgan, here are the total number of matches at each snapshot in time: 5,100 in July 2017, 7,494 in Feb 2018 and 9,026 in May 2018

American or Irish readers may think my actual numbers are pretty small. My heritage is half Irish and half African, and so I have very few matches on one side. Those with two Irish parents and grandparents may have twice my totals, while U.S. readers may have much higher totals than mine.

But it’s the trend that is of particular interest: the total matches near doubled across a ten month period!

It would have been nice to have evenly distributed snapshots e.g. quarterly, but the results do show that the rate of growth accelerated in 2018. I assume that the reason is due to increased marketing spend resulting in more people testing. I’m not aware of changes in matching methods in 2018 which might have increased matches among the existing sample base.

Those of you who have been using Ancestry for a while will remember that once upon a time the display showed the total number of matches, but that has been removed. There is a way of calculating the number without any coding, which I documented in this blog post.

(2) the distribution of matches by shared CM for each given snapshot

So this growth is impressive at first glance, but may be fairly useless if just one of the extra 4,000 matches are 4th cousins or closer, and the rest are a lowly 6.0 CM match. I am particularly interested in the number of matches broken down by CM, which Ancestry does not provide via the website. I used techniques documented here to record the CM of all my matches to one decimal point. But knowing the number of matches at 6.9 CM versus 6.8 isn’t of particular use. Instead, I’m going to roll up the matches into four ranges:

  • 6 CM to 6.9 CM
  • 7 CM to 9.9 CM
  • 10 CM to 19.9 CM
  • 20 CM and over

This is also known as segmenting the data and I would more naturally refer to these ranges as segments, but I don’t want to confuse with the genetic terminology of segments.

Why these ranges? It allows me to separate “the wheat from the chaff” in terms of how confident I can be that

(a) the match is genuine instead of due to random factors of DNA inheritance and

(b) I have a reasonable chance of investigating the common ancestor (easier for 3rd cousins than for genuine 4th cousins).

20 CM and over represents approximately the relatedness of fourth cousin and closer, with a high. These are the “lowest hanging fruit” i.e. the matches I’m most likely to focus on first to try to establish the precise family tree relationship. I can also be confident that these are genuine matches and unlikely to include false positives due to the random nature of DNA inheritance and recombination.

Blaine Bettinger has a good article on the danger zones below 20 CM based on his own results, while this ISOGG article is based on wider academic research. The conclusion is that matches from 10 CM and upward have a 94% probability of being genuine.

My lowest range is set to every match that is less than 7 CM, which represents those matches of which we can be least sure are genuine DNA matches as opposed to coincidence. Ancestry doesn’t show matches below 6.0 CM anyway.

There is a grey area between 7 and 10 CM where the certainty of a match is above 50% but below 94%. Users with hundreds of fourth cousin matches may choose to ignore the 7-10 CM range. Those of us with less matches may pay more attention to this range while recognizing that a significant percentage will not be useful. Thefore I’ve defined an “upper-middle” range of 10-19.9 CM and a “lower-middle” range of 7-9.9 CM.

Here are the raw numbers:

And here are those numbers in a bar chart:

Clearly all ranges are trending upwards. My “20 CM and over” rose from 55 to 87 to 91. A few months later, though I haven’t done full analysis, I was delighted to tip over the 100 mark. My next zone of interest, “10 to 19.9 CM” rose from 893 to 1,331 to 1,585.

(3) whether the distribution of matches stayed similar during this period of growth

The next figure shows the breakdown by % of total matches. What is immediately clear is that the percentages remain similar through the growth period. That is no great surprise. One scenario I can think of where the close CM percentage might increase is if I (or a close cousin) encouraged a large number of family members to test in the same time period which could cause a spike at the higher CM level. But as distant relatives would be behaving similarly, it would even out over time.

The percentages for “20 CM and above” are a little hard to read in the graphic, because its proportionally small. They read: 1.1%, 1.2%, 1.0%.

(4) the effect of the opt-out policy introduced in November 2017

Finally, I want to address the impact of the introduction of the opt-out clause in November 2017. There were concerns at the time amongst the blogosphere that matches might drop off a cliff as people rushed to hide themselves. The short answer is that according to my results, there is no need to be concerned.

I have documented the results in a separate blog post as it may be of more general interest than the detailed numbers presented in this blog.

Ancestry Opt-Outs are in low numbers

Ancestry introduced a new privacy policy in November 2017 that allowed customers to opt out of DNA matching. Customers who opt out receive their heritage breakdown, but do not see their DNA matches. They also are not shown on the match lists of all other customers.

Prior to this change, customers who wished to disengage would have to delete their DNA completely.

When the change was first announced, there was some concern among Ancestry users that the number of shared matches would reduce significantly if many existing customers chose to opt out. Then users noticed “missing” matches when they went to review past notes and manual lists. It didn’t help that there was also erratic behaviour of the Surname/Location Search functionality.

It’s difficult to be sure about the impact without snapshot lists of matches before and after the change. Unfortunately Ancestry does not provide the facility to download matches to spreadsheet, unlike some other companies.

In July 2017 I personally recorded details about all my DNA matches on AncestryDNA. I repeated the exercise in February 2018 and again in May 2018. I used techniques I describe here.

It’s just luck that one of my snaphots was before the change, but it allows me to do some analysis of the impact on my own results of people choosing to disengage from matching.

So, how many matches were in my snapshot of July 2017 that *were not* in my snapshot of February 2018?

A grand total of five matches disappeared.

I don’t know if they opted out or if they completely deleted their DNA and closed their accounts. However none of these five had reappeared by May 2018.

I wrote a separate blog post with full analysis and breakdown of my total numbers. As I gained gained 2,394 matches over the same time period, the opt-out number is insignificant. As that post also shows that over 80% of my matches were below 10 CM, statistically these low opt-outs are most likely to be low CM. I could have been really unlucky that one or two of those matches happened to be close, but I happen to know that they were all 10 CM or below.

Between February and May 2018 a further two people disappeared from my match list.

Again, they were of minor CM.

I acknowledge that these five could have been very close matches for some existing customers seeking to solve mysteries. The impact to those customers is of course highly significant. But overall, this is not a worrying factor for me in my continuing use of AncestryDNA.

How to count your total number of matches on AncestryDNA

Those of you who have been using Ancestry for a while will remember that once upon a time the display showed the total number of matches, but that has been removed. There is a way of calculating the number using the website alone, which just requires a bit of methodical clicking to identify the last page of your results.

Let’s say you identify the last page as page number 183 which shows a shortened list of only 12 matches. We know that there are 50 matches displayed on a full page, so you have (182*50) + 12 matches at that point in time i.e. 9,112 matches.

So how do you identify the last page?

You make use of the box at the top right of the page that allows you to jump to a specific page number.

If you enter a page number that is beyond the number of matches in your list, you will be shown this display:

So the trick is to find the page number directly before the lowest page which shows “nothing found”. It’s not as complicated as it reads.

The first thing to do is to enter a guesstimate number that is probably past your number of matches e.g. 300 (this would require 15,000 matches not to display the empty page).

What you do next depends on whether you get results or not on this candidate as the last page. I’ll describe both scenarios as a blow-by-blow.

(A) I see matches on page 300

If you actually get results, then keep entering an increased number in increments of 50 (an arbitrary choice) until you get the “no matches found” display.

Lets say you entered page 350 and get a full list of matches, but see “no matches found” on page 400. You know that somewhere between 350 and 400 is the last page of matches.

At this point you could work backwards from 350 in decrements of 1 until you find the highest page that shows matches. As that could potentially be a tedious 49 clicks, you can use a technique known as the binary chop: drop down by half your current gap i.e. to page 375 (half of 50). If no match is found, drop down by a further half (half of half of 50) i.e. 363, and then to 357, then 353, and finally 351. At some point in this sequence of maximum 5 clicks, you’ll hit a page that has either has a full list or a shortened list. If it’s the shortened list, you’ve found the last page (you can verify by hitting the next button to see “no matches found”). If you hit the full list, you just need to go up a page or two to find the shortened list.

(B) I see no matches on page 300

Keep entering page numbers in decrements of 50 (an arbitrary choice) until you see a page with matches.

If it’s a shortened list, then quickly verify that it is indeed the last page by using the next button to go up by 1 page.

If it’s a full page then you could at this point work upwards in increments of 1 until you find the first page that shows no matches. That is potentially 49 clicks which is tedious. You can use a technique known as the binary chop to save time.

Let’s say you dropped from page 250 to 200 where you found a full list of 50 matches. Go up by half your current gap i.e. to page 225. If matches are shown, then go up again by a further half (half of half of 50) i.e. 237, and then to 243, then 246, and finally 249. Eventually you’ll hit a page that has no matches or a shortened list. If it’s the shortened list, then bingo, you’ve found the last page (you can verify by hitting the next button to see “no matches found”). If you hit a page with matches, you need to go up a page or two to find the shortened list.

And that’s it! Now yust use the calculation: ((last page – 1) * 50) + remainder e.g. (182 * 50) + 12

Then make a note somewhere of what the actual last page is, so if you want to do another count in the future you can add an arbitrary gap and work back down with the binary chop.