Analysis of Tree Availability of My Matches on Ancestry DNA

In May 2018 I conducted a review of all my DNA matches on Ancestry and recorded whether the match had a public or private tree, and whether a tree was linked to their DNA. I used techniques I describe here.

The headline numbers are: 40% of my matches have a public linked tree and 27% have at least one public unlinked tree. 7% have only a private tree. 26% had no tree at all.

The headline numbers don’t capture whether the tree is actually useful. Of the 67% of matches with an available tree, 11% have everybody set to living status, so their details are hidden. Just to be clear, that’s a rounded 8% of my total matches, which is a significant number and is close to the 9% of matches who have set their full tree to private.

I’ll discuss a more detailed assessment of the “usefulness” of trees in a separate post. Here, I will show some comparisons of the headline numbers with other Ancestry users who have conducted similar analysis.

Blaine Bettinger reported his breakdown in January 2017 based on a manual analysis of his top 500 matches. I think it’s a reasonable sample. Martin Abrams commented on Blaine’s posts with a manual survey of his top 200 matches. Here’s a side by side comparison of my numbers with the two gentlemen. My breakdown is remarkably similar to Martin Abram’s figures, while Blaine seems to have all the luck in terms of low “No Tree” matches.

I searched for other reported numbers, but others seem to be using the DnaGedcom utility for analysis and that excellent utility does not identify private versus unattached trees. I’ve put the technical details of my own method of analysis at the end of this blog post.

There is no particular conclusion to be drawn based on three people from separate time periods, other than a personal assessment as to whether Ancestry continues to be useful to me for my particular research interests. As new matches trickle in, if they become predominantly “No Tree” then that would be problematic.

There is a worry amongst genetic genealogy enthusiasts that Ancestry’s marketing focus on ethnicity reports is bringing in swathes of new matches with no interest in creating trees. Readers of prior posts may remember that I ran an analysis of all my matches on two prior occasions including June 2017. Unfortunately I wasn’t capturing tree information at those times. I do intend to repeat the snapshot in August and beyond so I will start to get trends within my own data.

Technical Details

My own utility identifies private trees with a simple text search for Ancestry’s phrasing on the Match page: “family tree is private”.

For unlinked trees, there is a drop down list box on the Match page inviting you to select a tree from a list. Even if the list only has one tree, the list box is present: this allows me to identify matches with unlinked tree(s). My assumption here is that the listbox will not include any private trees i.e. if the match only has an unlinked private tree, that Ancestry will display their status identically to the “No Tree” people. I’m open to correction here!

Ten Months of Growth on AncestryDNA

In July 2017 I reviewed all my DNA matches on AncestryDNA by recording  the shared CM number of every match. I repeated the exercise in February 2018 and again in early May 2018. I used techniques I describe in this blog post.

I was particularly interested in:

(1) the rate of growth of matches over this 10 month period of heavy marketing by the company

(2) the distribution of matches by shared CM for each given period e.g. my total number of matches below 10 CM, between 10 and 20 CM, and above 20 CM

(3) whether the distribution of matches stayed similar during this period of growth e.g. are the growing numbers solely due to tiny 6 CM matches

(4) what was the effect of the opt-out policy introduced in November 2017 e.g. did I lose a lot of matches who opted out

This blog post looks in detail at trends within my personal data.

(1) the rate of growth of matches over this 10 month period

Before I break down the matches by centimorgan, here are the total number of matches at each snapshot in time: 5,100 in July 2017, 7,494 in Feb 2018 and 9,026 in May 2018

American or Irish readers may think my actual numbers are pretty small. My heritage is half Irish and half African, and so I have very few matches on one side. Those with two Irish parents and grandparents may have twice my totals, while U.S. readers may have much higher totals than mine.

But it’s the trend that is of particular interest: the total matches near doubled across a ten month period!

It would have been nice to have evenly distributed snapshots e.g. quarterly, but the results do show that the rate of growth accelerated in 2018. I assume that the reason is due to increased marketing spend resulting in more people testing. I’m not aware of changes in matching methods in 2018 which might have increased matches among the existing sample base.

Those of you who have been using Ancestry for a while will remember that once upon a time the display showed the total number of matches, but that has been removed. There is a way of calculating the number without any coding, which I documented in this blog post.

(2) the distribution of matches by shared CM for each given snapshot

So this growth is impressive at first glance, but may be fairly useless if just one of the extra 4,000 matches are 4th cousins or closer, and the rest are a lowly 6.0 CM match. I am particularly interested in the number of matches broken down by CM, which Ancestry does not provide via the website. I used techniques documented here to record the CM of all my matches to one decimal point. But knowing the number of matches at 6.9 CM versus 6.8 isn’t of particular use. Instead, I’m going to roll up the matches into four ranges:

  • 6 CM to 6.9 CM
  • 7 CM to 9.9 CM
  • 10 CM to 19.9 CM
  • 20 CM and over

This is also known as segmenting the data and I would more naturally refer to these ranges as segments, but I don’t want to confuse with the genetic terminology of segments.

Why these ranges? It allows me to separate “the wheat from the chaff” in terms of how confident I can be that

(a) the match is genuine instead of due to random factors of DNA inheritance and

(b) I have a reasonable chance of investigating the common ancestor (easier for 3rd cousins than for genuine 4th cousins).

20 CM and over represents approximately the relatedness of fourth cousin and closer, with a high. These are the “lowest hanging fruit” i.e. the matches I’m most likely to focus on first to try to establish the precise family tree relationship. I can also be confident that these are genuine matches and unlikely to include false positives due to the random nature of DNA inheritance and recombination.

Blaine Bettinger has a good article on the danger zones below 20 CM based on his own results, while this ISOGG article is based on wider academic research. The conclusion is that matches from 10 CM and upward have a 94% probability of being genuine.

My lowest range is set to every match that is less than 7 CM, which represents those matches of which we can be least sure are genuine DNA matches as opposed to coincidence. Ancestry doesn’t show matches below 6.0 CM anyway.

There is a grey area between 7 and 10 CM where the certainty of a match is above 50% but below 94%. Users with hundreds of fourth cousin matches may choose to ignore the 7-10 CM range. Those of us with less matches may pay more attention to this range while recognizing that a significant percentage will not be useful. Thefore I’ve defined an “upper-middle” range of 10-19.9 CM and a “lower-middle” range of 7-9.9 CM.

Here are the raw numbers:

And here are those numbers in a bar chart:

Clearly all ranges are trending upwards. My “20 CM and over” rose from 55 to 87 to 91. A few months later, though I haven’t done full analysis, I was delighted to tip over the 100 mark. My next zone of interest, “10 to 19.9 CM” rose from 893 to 1,331 to 1,585.

(3) whether the distribution of matches stayed similar during this period of growth

The next figure shows the breakdown by % of total matches. What is immediately clear is that the percentages remain similar through the growth period. That is no great surprise. One scenario I can think of where the close CM percentage might increase is if I (or a close cousin) encouraged a large number of family members to test in the same time period which could cause a spike at the higher CM level. But as distant relatives would be behaving similarly, it would even out over time.

The percentages for “20 CM and above” are a little hard to read in the graphic, because its proportionally small. They read: 1.1%, 1.2%, 1.0%.

(4) the effect of the opt-out policy introduced in November 2017

Finally, I want to address the impact of the introduction of the opt-out clause in November 2017. There were concerns at the time amongst the blogosphere that matches might drop off a cliff as people rushed to hide themselves. The short answer is that according to my results, there is no need to be concerned.

I have documented the results in a separate blog post as it may be of more general interest than the detailed numbers presented in this blog.

Ancestry Opt-Outs are in low numbers

Ancestry introduced a new privacy policy in November 2017 that allowed customers to opt out of DNA matching. Customers who opt out receive their heritage breakdown, but do not see their DNA matches. They also are not shown on the match lists of all other customers.

Prior to this change, customers who wished to disengage would have to delete their DNA completely.

When the change was first announced, there was some concern among Ancestry users that the number of shared matches would reduce significantly if many existing customers chose to opt out. Then users noticed “missing” matches when they went to review past notes and manual lists. It didn’t help that there was also erratic behaviour of the Surname/Location Search functionality.

It’s difficult to be sure about the impact without snapshot lists of matches before and after the change. Unfortunately Ancestry does not provide the facility to download matches to spreadsheet, unlike some other companies.

In July 2017 I personally recorded details about all my DNA matches on AncestryDNA. I repeated the exercise in February 2018 and again in May 2018. I used techniques I describe here.

It’s just luck that one of my snaphots was before the change, but it allows me to do some analysis of the impact on my own results of people choosing to disengage from matching.

So, how many matches were in my snapshot of July 2017 that *were not* in my snapshot of February 2018?

A grand total of five matches disappeared.

I don’t know if they opted out or if they completely deleted their DNA and closed their accounts. However none of these five had reappeared by May 2018.

I wrote a separate blog post with full analysis and breakdown of my total numbers. As I gained gained 2,394 matches over the same time period, the opt-out number is insignificant. As that post also shows that over 80% of my matches were below 10 CM, statistically these low opt-outs are most likely to be low CM. I could have been really unlucky that one or two of those matches happened to be close, but I happen to know that they were all 10 CM or below.

Between February and May 2018 a further two people disappeared from my match list.

Again, they were of minor CM.

I acknowledge that these five could have been very close matches for some existing customers seeking to solve mysteries. The impact to those customers is of course highly significant. But overall, this is not a worrying factor for me in my continuing use of AncestryDNA.

How to count your total number of matches on AncestryDNA

Those of you who have been using Ancestry for a while will remember that once upon a time the display showed the total number of matches, but that has been removed. There is a way of calculating the number using the website alone, which just requires a bit of methodical clicking to identify the last page of your results.

Let’s say you identify the last page as page number 183 which shows a shortened list of only 12 matches. We know that there are 50 matches displayed on a full page, so you have (182*50) + 12 matches at that point in time i.e. 9,112 matches.

So how do you identify the last page?

You make use of the box at the top right of the page that allows you to jump to a specific page number.

If you enter a page number that is beyond the number of matches in your list, you will be shown this display:

So the trick is to find the page number directly before the lowest page which shows “nothing found”. It’s not as complicated as it reads.

The first thing to do is to enter a guesstimate number that is probably past your number of matches e.g. 300 (this would require 15,000 matches not to display the empty page).

What you do next depends on whether you get results or not on this candidate as the last page. I’ll describe both scenarios as a blow-by-blow.

(A) I see matches on page 300

If you actually get results, then keep entering an increased number in increments of 50 (an arbitrary choice) until you get the “no matches found” display.

Lets say you entered page 350 and get a full list of matches, but see “no matches found” on page 400. You know that somewhere between 350 and 400 is the last page of matches.

At this point you could work backwards from 350 in decrements of 1 until you find the highest page that shows matches. As that could potentially be a tedious 49 clicks, you can use a technique known as the binary chop: drop down by half your current gap i.e. to page 375 (half of 50). If no match is found, drop down by a further half (half of half of 50) i.e. 363, and then to 357, then 353, and finally 351. At some point in this sequence of maximum 5 clicks, you’ll hit a page that has either has a full list or a shortened list. If it’s the shortened list, you’ve found the last page (you can verify by hitting the next button to see “no matches found”). If you hit the full list, you just need to go up a page or two to find the shortened list.

(B) I see no matches on page 300

Keep entering page numbers in decrements of 50 (an arbitrary choice) until you see a page with matches.

If it’s a shortened list, then quickly verify that it is indeed the last page by using the next button to go up by 1 page.

If it’s a full page then you could at this point work upwards in increments of 1 until you find the first page that shows no matches. That is potentially 49 clicks which is tedious. You can use a technique known as the binary chop to save time.

Let’s say you dropped from page 250 to 200 where you found a full list of 50 matches. Go up by half your current gap i.e. to page 225. If matches are shown, then go up again by a further half (half of half of 50) i.e. 237, and then to 243, then 246, and finally 249. Eventually you’ll hit a page that has no matches or a shortened list. If it’s the shortened list, then bingo, you’ve found the last page (you can verify by hitting the next button to see “no matches found”). If you hit a page with matches, you need to go up a page or two to find the shortened list.

And that’s it! Now yust use the calculation: ((last page – 1) * 50) + remainder e.g. (182 * 50) + 12

Then make a note somewhere of what the actual last page is, so if you want to do another count in the future you can add an arbitrary gap and work back down with the binary chop.

A brief technical overview of data mining dna matches on Ancestry

The purpose of this blog is to present some analysis of the available data from several DNA testing companies for one or more DNA kits. This post is a high-level description of how the data is retrieved and analysed from AncestryDNA.

AncestryDNA presents “Match List Pages” with each match as a row displaying the user name and a URL link. There are 50 matches per page. A kit owner may have hundreds of Match List Pages. We navigate through the pages by next/previous buttons at the top and bottom of the page. We can also jump to a specific page number by manipulating the numerals at the end of the Match List Page URL. This allows us to iterate through our pages by incrementing the URL page number by 1 until we come to a page that tells us there are no more matches to be displayed. Note that the page doesn’t throw an error, which is nice for programmatic reasons.

no matches found

A specific match has a number of attributes that are viewable from its “Match Page”, which can be opened by clicking on the match link on the “Match List Page”. The “Match Page” can also be opened directly (when logged in, of course) by entering the address of the specific URL in the browser.

The “Match Page” has three tabs with data of interest (tree, shared matches, geography). It also has attributes available from pop-up windows. The tabbed displays are loaded at run-time i.e. the list of shared matches are not available when the Match Page opens to the default Tree tab. These factors ensure that viewing all the attributes of interest require a number of clicks, plus some latency in waiting for the new information to be displayed.

These are some of the attributes of interest:

  • How many centimorgans and segments is the match?
  • Does the match have a tree?
  • If the match has a tree, is it public?
  • If the match has a tree, is the kit linked to the tree?
  • If the match has a public tree, then the list of direct ancestor names displayed in the first tab becomes an attribute of interest.
  • If there are shared matches, then the “Shared Match” tab can be selected to display the list of shared matches.

To collect the attributes of interest for every match of a DNA kit, we simply have to:

open the “first” Match List Page


FOR all the rows (match links) on the Match List page

open the Match Page

collect the attributes of interest (needs a few clicks)

open the “next” Match List Page

UNTIL there are no more Match List Pages

When I first used Ancestry, I started to do this manually for my 4th cousin matches. After a very tedious Sunday of clicks, I looked online for an automated solution. I found several utilities and browswer add-ons that will do the job and produce spreadsheets of attributes. They’re very good, but didn’t quite suit my purposes. So I devised my own process to deliver the same basic functionality.

Technically, I use Python scripts and a number of Python libraries that facilitate the interaction with web pages. I load my data to SQL and NoSQL data technologies to facilitate data analysis. I also use R for data analysis and presentation.

The add-ons and utilities would not be needed for Ancestry if the company would provide users with the facility to download their match data to file. They don’t seem to intend to provide this in the foreseeable future.

If any of us were to embark on manually retrieving all our match data from Ancestry, it would take weeks of clicking and copying.  The utilities (mine and others) also take time to complete. The common factor is the number of matches that we have, but I expect most users will need many hours for any utility to complete.

How many Irish cousins: the impact of endogamy- Part 4 in Series

In previous posts I described the calculations used by several sources to predict the number of cousins we have at different degrees of cousinship (first, second etc). The calculations assume that our ancestors are not related to each other i.e. no first-cousin marriages or other types of inbreeding. If we have a significant number of ancestor couples who were in fact related, this has a two-fold impact on our own genealogy research.

First, it reduces the total number of cousins that we have. If our grandparents weren’t related then we gain different cousins from both sides of the relationship.

Second, it increases the genetic similarity between ourselves and our cousins. That skews the predictions of cousinship from DNA testing companies i.e. we would share similar amounts of DNA with our second cousins as other people share with their first cousins.

For Irish people trying to use DNA matching to build out their family tree and solve certain mysteries in their heritage, it’s important to know the likely incidence of genetic relationship amongst our recent and distant ancestors.

It’s worth defining a few academic terms used in research in this area.
“Endogamy” is the practice of marrying within a specific social group. It may be due to geographic isolation, or for religious reasons, or a sociological wish amongst a group to preserve traditions. The smaller the population of the group, the more likely that a proportion of couples will be related.

Consanguinity means being descended from the same ancestor as another person. So “consanguinious marriages” are marriages in which the couple are related.

Panmictic means random mating within a population i.e. free from influence of social, geographic or genetic preference. It is another stated assumption in the 23andMe research discussed in my previous post.

Okay, with those terms out of the way, let’s look at academic research which will determine whether the Irish can use with confidence the calculations and predicted figures of numbers of cousins.

Marriage between close family members is strictly illegal in Ireland. Marriage between first cousins is legal, but to be married in a Catholic church the couple must obtain a dispensation from the clerical authorities. If you see mention of a dispensation in ancestral marriage records, be sure to take a closer look.

bishopJ.G. Masterson studied the rate of first-cousin dispensations amongst Catholics in Ireland from 1959-1968. He found a rate of 1 in 720 for the entire island, and 1 in 625 for the Republic. That is less than 0.2%. Masterson also quotes a study from 1883 that simply asked people if they were the children of first cousins. A little less than 0.6% said that they were. I imagine that the falling trend over a hundred years was partly due to decreasing isolation of local populations (i.e. easier to travel).

So we can say in general that first cousin marriages are historically low amongst our Irish ancestors within the last hundred years.

The major exception is amongst a particular section of our population, the Travelling Community. Irish Government figures from 2003 reported that over 20% of marriages within the Travelling Community were between first cousins. This does mean that Travellers will have particular challenges in researching genetic ancestry. There are other endogamous communites taking great interest in similar research who may have useful methods for people from a Travelling background.

In 2017 Irish Travellers only account for about 0.6% of the Irish population. Therefore most Irish people researching their family history can assume a low level of first-cousin ancestral couples for well over a century. More distant consanguinity, such as third cousins, is more likely to be present due to limitations of travel. This will lower the numbers generated by the calculated model, but not with such signficance as experienced by populations with higher rates of endogamy.

I conclude that most Irish people can take as ballpark figures the calculated numbers of cousins by degree, as long as a realistic birth rate is used.

How many Irish cousins: according to AncestryDNA – Part 3 in Series

My previous blog post was on research by scientists at 23andMe on predicting our number of cousins. I applaud 23andMe for the publication of the research in detail.

AncestryDNA have also researched the topic but as far as I can find, they release the information through the marketing department with big headline numbers and not a lot of detail.

Their most recent release of information was to mark World DNA day in April 2018. Unfortunately some news outlets reported that the Irish have 14,000 cousins while others reported Ancestry as saying that we have 14,000 LIVING cousins up to a distance of 8th cousin. The living distinction is important, but I was more disillusioned on realizing this number is up to 8th degree. I’d like to see the breakdown estimates at each degree, as done by 23andMe, but cannot find the details for Ireland.

AncestryDNA did conduct a detailed study using British birth rates and census data to prouct statistics for the average British person, the numbers are shown here. The numbers are lower than in the 23andMe study by Denn et al. The formula used by Denn and Tim X to generate predicted number of cousins is 

If AncestryDNA used the same formula as 23andMe (Denn) and Tim Urban (as discussed in my blog post here), then figuring out the birth rate they utilized is a matter of solving a quadratic equation. I took a shortcut by using symbolab. Plugging in a result of 5 for first cousins and 28 for second cousins, I calculated that if they’d used the formula then their birth rate would have been 2.3. However, solving the same equation of their figures of 3rd and 4th cousins produced different birth rates.

I’m reluctant to repeat numbers for which I can’t explain the provenance, but for the sake of completeness – here are Ancestry’s estimates for British users:

Whatever birth rate and formula they used, the Irish birth rate was significantly higher than Britain prior to 1990 so I’m not particularly interested in the British predictions for the purpose of this post. The problem of course is that Ancestry state that they used census data and other statistics going back 200 years for their calculations, and these records are not generally available for Ireland (no fault of Ancestry here, the sad fact is that Irish records pre-20th century are very patchy).

So it looks like all we have on Ireland from AncestryDNA are their reported calculation of a total of 14,000 1st to 8th living cousins. How does that compare to the predicted totals from my previous posts – which don’t take into account whether these cousins are living or dead? Hard to say, as I can’t find any details as to how AncestryDNA calculated probability of living. There’s a pattern forming here. Even the ISOGG Wiki which have a page on cousin statistics have to cite the Daily Mirror as their source of AncestryDNA information. With all due respect to the Mirror, it’s a tabloid newspaper as opposed to the peer-reviewed journal that published the 23andMe research.

I do hope that AncestryDNA will produce the level of detail as done by 23andMe, but until then, their estimates are a bit of a bust for me. Knowing the assumed birth rate and other assumptions is important to assess whether the figures realistic for Irish users. In my next post in this series I’ll discuss some of the assumptions and caveats to be considered.

How many Irish cousins: according to 23andMe – Part 2 in Series

Yesterday I wrote a blog post based on an entertaining article from Tim Urban to calculate our number of cousins at various distances using birth rate statistics. Today’s post is based on an academic study by researchers from 23andMe (Henn et al) which covers a wealth of complex analysis that includes a table of “expected number of cousins” at degree of cousinship. This table is reproduced in several websites without the detail of the formula that arrived at the very specific numbers.

It took me a few reads to realize that the formula used by Henn et al is exactly the same as what Tim Urban devised. The formula is kinda buried near the end of the article under a section titled “Calculation of expected number of individuals sharing DNA IBD. Yep, that translates to “how many cousins do we have.”

Critically, this section specifies that a birth rate of 2.5 is used to produce the cousin numbers in the table. I’ll come to that later, as it may be more appropriate to the United States than to Irish people over a certain age.

I actually found the academic explanation easier to follow than Tim’s post. So now I’ll try to explain the formula as opposed to using it blindly to produce numbers. The formula is expressed using i as the degree of cousinship and z as birth rate. :

Why 2 to the power of i? For our first cousins, we share two sets of grandparents, therefore two couples. For our secound cousins we share four sets of great grandparents, therefore four couples. And so on up the generations, where the number of couples is equal to 2 to the power of the degrees of cousinship.

Why (z-1)? If one set of grandparents have four children, one of those children is our parent and that person’s children are ourselves and our average of three siblings so we need to remove 1 from the birth rate.

z to the power of i is the total number of non-ancestral cousins that stem from a particular ancestral generation. Thus the number of cousins from one set of ancestors is the product of (z-1) and z^i.  The full total is then achieved by multiplying by the number of ancestral couples for the degree of cousinship.

The table below has the totals for three different fertility rates in recent Irish history. The fourth line uses the fertility rate chosen by Denn et all for their numbers.

As befits a scholarly study, Denn et al also specify the assumptions that they made to simplify calculations. “Perfect survivorship” jumps out for me i.e. that every offsping at every generation lives to produce the average number of children. Given that the user base of 23andMe is predominantly American and with disposable income for personal DNA testing, this might be a reasonable assumption for their purposes. I do think we Irish need to pay some attention to the impact of our mortality rates up to the 1970s.

They also list other assumptions which I’d like to consider. But I’ll save that for another post in this “how many Irish cousins” series. In my next post I want to look at the quoted numbers from AncestryDNA.

How many Irish cousins: according to Tim Urban – Part 1 in Series

Tim Urban wrote an entertaining blog post in 2014 on calculating a ballpark number of cousins based on your country’s average birth statistics. His formula breaks down the totals by degree of cousin i.e. 1st/2nd/3rd/4th and outward.

Tim calculates numbers for USA, UK, Canada and a few other countries, but not Ireland – so I figured I’d crunch the figures for green.

Tim’s formula iswhere “n” is the fertility rate and “d” is the degree of cousin.

The 2013 fertility rates from NationMaster report fertility rates of 1.9 for the U.K, 2.01 for Ireland, and 2.06 for the U.S. Here are the totals for the three countries, with Ireland in the middle line.

Figures for 2013

For now, ignore the eye-watering totals at fifth cousin and beyond. I took one look at the predicted total of 4.1 for First Cousins and rechecked my calculation for a mistake. It just looked suspiciously low to me. My reaction wasn’t based solely on my own family. From the life-cycle of weddings, christenings, and funerals, you tend to have a passing familiarity of the family structures of your friends and neighbours.

A moment’s thought reminded me that Ireland’s birth rate has dropped in recent decades from one of the highest in Europe to closer to the average. The nearest published rate I could find for my birth year was 1970’s rate of 3.87 which predicts 22 first cousins using Tim’s formula. Quite the difference! The same publication reported fertility rates of 2.08 for 1989 and about 3.9 for 1950.

Here are the figures for Ireland only for those years:

figures Ireland 1970 and other yearsNow you can shift your eyes right, and look at the totals of more distant cousins. Again, I thought I’d got the calculations wrong, this time because those numbers were so big. Thankfully, John Reid of Anglo-Celtic Canada Connections has taken Tim’s formula and applied across a wide range of fertility rates, so a spot check stopped me from fretting.

I’ve seen other reported projections of numbers of cousins, including one from Ancestry and one from 23andMe. I’ll address them in other blog posts and compare them with Tim’s.

How your DNA kit can fail to process

The number of people submitting DNA samples is rising, with a corresponding rise in the number of people on forums reporting that they have been asked by the test company to submit another sample.
AncestryDNA say in their FAQ that:

During the testing process, each DNA sample is held to a quality standard of at least a 98% call rate. Any results that don’t meet that standard may require a new DNA sample to be collected.

A call rate? I don’t see that term explained elsewhere in their FAQ so I thought it worth a blog post on how the companies determine that a particular kit doesn’t meet their quality standards. As well as “call rate”, I’ll explain a few other terms along the way.
When the test lab receives your DNA sample, they do not analyse the entire collection of DNA in your swab or spit. The technology right now doesn’t scale up to deal with every gene. Instead, your sample is broken down to identify a set of very specific positions within your chromosomes.

We humans are basically the same as each other across 99.9% of our DNA sequence. That means that at many positions your DNA will be identical between you, your sister Sally, and that guy who delivered the kit to your door. Unless “that guy” is also your brother, then most parts of your DNA sequence will not allow the lab to distinguish between your siblings and strangers at the door.

Instead, the testing companies target positions where there is likely to be genetic variation within our species. These variants at a particular position are known as Single-Nucleotide Polymorphisms, or SNPs for short. Don’t worry too much about the mouthful, but be aware that there are only four types of nucleotides and they go by the code names of A, C, G or T.

You’re probably familiar with the DNA helix, so where do these codes fit into it? In this picture of chromosomes, there are two long strings coiled around each other. Each string is a sequence of nucleotides: A, C, G or T, in a particular order.

So lets take the sequence AGTCAAGTCAAGTC. You and big sis Sally share that sequence. But what about the delivery guy? His sequence is AGTCAAGTCAAGTC. Is it the same?

Let’s line them up to see:
One nucleotide at a particular position is different between you and that guy.
Suppose we looked at the same sequence for the delivery guy’s brother. it would probably have C instead of T too.

But wait! It might not. The delivery guy and his brother could have inherited the DNA at that precise position from different parents. Or they could differ due to the random shuffling in DNA during reproduction. Due to inheritance and random variation, the labs need to test hundreds of thousands of alleles to let statistics take over and make us confident that we can use use the average variance across SNPs to determine how much we match our siblings versus our non-relatives.
Identifying the nucleotide code of each SNP is known as “calling” the SNP.

Now, here’s the problem. Sometimes the lab equipment cannot identify the nucleotide code.  This can be due to contamination, or to equipment error, or to the way that DNA is broken down for the analysis.

If an SNP cannot be identified, this is known as a “no-call“.

The overall call rate of your DNA sample is the number of SNPs that were coded divided by the total number of SNPs that the lab tried to code. So if 2% fail to be coded then the kit will probably be retested, and if it continues to fail, you will be asked to send a new sample. Yep, that’s the delivery guy back at the door: