A brief technical overview of data mining dna matches on Ancestry

The purpose of this blog is to present some analysis of the available data from several DNA testing companies for one or more DNA kits. This post is a high-level description of how the data is retrieved and analysed from AncestryDNA.

AncestryDNA presents “Match List Pages” with each match as a row displaying the user name and a URL link. There are 50 matches per page. A kit owner may have hundreds of Match List Pages. We navigate through the pages by next/previous buttons at the top and bottom of the page. We can also jump to a specific page number by manipulating the numerals at the end of the Match List Page URL. This allows us to iterate through our pages by incrementing the URL page number by 1 until we come to a page that tells us there are no more matches to be displayed. Note that the page doesn’t throw an error, which is nice for programmatic reasons.

no matches found

A specific match has a number of attributes that are viewable from its “Match Page”, which can be opened by clicking on the match link on the “Match List Page”. The “Match Page” can also be opened directly (when logged in, of course) by entering the address of the specific URL in the browser.

The “Match Page” has three tabs with data of interest (tree, shared matches, geography). It also has attributes available from pop-up windows. The tabbed displays are loaded at run-time i.e. the list of shared matches are not available when the Match Page opens to the default Tree tab. These factors ensure that viewing all the attributes of interest require a number of clicks, plus some latency in waiting for the new information to be displayed.

These are some of the attributes of interest:

  • How many centimorgans and segments is the match?
  • Does the match have a tree?
  • If the match has a tree, is it public?
  • If the match has a tree, is the kit linked to the tree?
  • If the match has a public tree, then the list of direct ancestor names displayed in the first tab becomes an attribute of interest.
  • If there are shared matches, then the “Shared Match” tab can be selected to display the list of shared matches.

To collect the attributes of interest for every match of a DNA kit, we simply have to:

open the “first” Match List Page

REPEAT

FOR all the rows (match links) on the Match List page

open the Match Page

collect the attributes of interest (needs a few clicks)

open the “next” Match List Page

UNTIL there are no more Match List Pages

When I first used Ancestry, I started to do this manually for my 4th cousin matches. After a very tedious Sunday of clicks, I looked online for an automated solution. I found several utilities and browswer add-ons that will do the job and produce spreadsheets of attributes. They’re very good, but didn’t quite suit my purposes. So I devised my own process to deliver the same basic functionality.

Technically, I use Python scripts and a number of Python libraries that facilitate the interaction with web pages. I load my data to SQL and NoSQL data technologies to facilitate data analysis. I also use R for data analysis and presentation.

The add-ons and utilities would not be needed for Ancestry if the company would provide users with the facility to download their match data to file. They don’t seem to intend to provide this in the foreseeable future.

If any of us were to embark on manually retrieving all our match data from Ancestry, it would take weeks of clicking and copying.  The utilities (mine and others) also take time to complete. The common factor is the number of matches that we have, but I expect most users will need many hours for any utility to complete.

How many Irish cousins: the impact of endogamy- Part 4 in Series

In previous posts I described the calculations used by several sources to predict the number of cousins we have at different degrees of cousinship (first, second etc). The calculations assume that our ancestors are not related to each other i.e. no first-cousin marriages or other types of inbreeding. If we have a significant number of ancestor couples who were in fact related, this has a two-fold impact on our own genealogy research.

First, it reduces the total number of cousins that we have. If our grandparents weren’t related then we gain different cousins from both sides of the relationship.

Second, it increases the genetic similarity between ourselves and our cousins. That skews the predictions of cousinship from DNA testing companies i.e. we would share similar amounts of DNA with our second cousins as other people share with their first cousins.

For Irish people trying to use DNA matching to build out their family tree and solve certain mysteries in their heritage, it’s important to know the likely incidence of genetic relationship amongst our recent and distant ancestors.

It’s worth defining a few academic terms used in research in this area.
“Endogamy” is the practice of marrying within a specific social group. It may be due to geographic isolation, or for religious reasons, or a sociological wish amongst a group to preserve traditions. The smaller the population of the group, the more likely that a proportion of couples will be related.

Consanguinity means being descended from the same ancestor as another person. So “consanguinious marriages” are marriages in which the couple are related.

Panmictic means random mating within a population i.e. free from influence of social, geographic or genetic preference. It is another stated assumption in the 23andMe research discussed in my previous post.

Okay, with those terms out of the way, let’s look at academic research which will determine whether the Irish can use with confidence the calculations and predicted figures of numbers of cousins.

Marriage between close family members is strictly illegal in Ireland. Marriage between first cousins is legal, but to be married in a Catholic church the couple must obtain a dispensation from the clerical authorities. If you see mention of a dispensation in ancestral marriage records, be sure to take a closer look.

bishopJ.G. Masterson studied the rate of first-cousin dispensations amongst Catholics in Ireland from 1959-1968. He found a rate of 1 in 720 for the entire island, and 1 in 625 for the Republic. That is less than 0.2%. Masterson also quotes a study from 1883 that simply asked people if they were the children of first cousins. A little less than 0.6% said that they were. I imagine that the falling trend over a hundred years was partly due to decreasing isolation of local populations (i.e. easier to travel).

So we can say in general that first cousin marriages are historically low amongst our Irish ancestors within the last hundred years.

The major exception is amongst a particular section of our population, the Travelling Community. Irish Government figures from 2003 reported that over 20% of marriages within the Travelling Community were between first cousins. This does mean that Travellers will have particular challenges in researching genetic ancestry. There are other endogamous communites taking great interest in similar research who may have useful methods for people from a Travelling background.

In 2017 Irish Travellers only account for about 0.6% of the Irish population. Therefore most Irish people researching their family history can assume a low level of first-cousin ancestral couples for well over a century. More distant consanguinity, such as third cousins, is more likely to be present due to limitations of travel. This will lower the numbers generated by the calculated model, but not with such signficance as experienced by populations with higher rates of endogamy.

I conclude that most Irish people can take as ballpark figures the calculated numbers of cousins by degree, as long as a realistic birth rate is used.

How many Irish cousins: according to AncestryDNA – Part 3 in Series

My previous blog post was on research by scientists at 23andMe on predicting our number of cousins. I applaud 23andMe for the publication of the research in detail.

AncestryDNA have also researched the topic but as far as I can find, they release the information through the marketing department with big headline numbers and not a lot of detail.

Their most recent release of information was to mark World DNA day in April 2018. Unfortunately some news outlets reported that the Irish have 14,000 cousins while others reported Ancestry as saying that we have 14,000 LIVING cousins up to a distance of 8th cousin. The living distinction is important, but I was more disillusioned on realizing this number is up to 8th degree. I’d like to see the breakdown estimates at each degree, as done by 23andMe, but cannot find the details for Ireland.

AncestryDNA did conduct a detailed study using British birth rates and census data to prouct statistics for the average British person, the numbers are shown here. The numbers are lower than in the 23andMe study by Denn et al. The formula used by Denn and Tim X to generate predicted number of cousins is 

If AncestryDNA used the same formula as 23andMe (Denn) and Tim Urban (as discussed in my blog post here), then figuring out the birth rate they utilized is a matter of solving a quadratic equation. I took a shortcut by using symbolab. Plugging in a result of 5 for first cousins and 28 for second cousins, I calculated that if they’d used the formula then their birth rate would have been 2.3. However, solving the same equation of their figures of 3rd and 4th cousins produced different birth rates.

I’m reluctant to repeat numbers for which I can’t explain the provenance, but for the sake of completeness – here are Ancestry’s estimates for British users:

Whatever birth rate and formula they used, the Irish birth rate was significantly higher than Britain prior to 1990 so I’m not particularly interested in the British predictions for the purpose of this post. The problem of course is that Ancestry state that they used census data and other statistics going back 200 years for their calculations, and these records are not generally available for Ireland (no fault of Ancestry here, the sad fact is that Irish records pre-20th century are very patchy).

So it looks like all we have on Ireland from AncestryDNA are their reported calculation of a total of 14,000 1st to 8th living cousins. How does that compare to the predicted totals from my previous posts – which don’t take into account whether these cousins are living or dead? Hard to say, as I can’t find any details as to how AncestryDNA calculated probability of living. There’s a pattern forming here. Even the ISOGG Wiki which have a page on cousin statistics have to cite the Daily Mirror as their source of AncestryDNA information. With all due respect to the Mirror, it’s a tabloid newspaper as opposed to the peer-reviewed journal that published the 23andMe research.

I do hope that AncestryDNA will produce the level of detail as done by 23andMe, but until then, their estimates are a bit of a bust for me. Knowing the assumed birth rate and other assumptions is important to assess whether the figures realistic for Irish users. In my next post in this series I’ll discuss some of the assumptions and caveats to be considered.

How many Irish cousins: according to 23andMe – Part 2 in Series

Yesterday I wrote a blog post based on an entertaining article from Tim Urban to calculate our number of cousins at various distances using birth rate statistics. Today’s post is based on an academic study by researchers from 23andMe (Henn et al) which covers a wealth of complex analysis that includes a table of “expected number of cousins” at degree of cousinship. This table is reproduced in several websites without the detail of the formula that arrived at the very specific numbers.

It took me a few reads to realize that the formula used by Henn et al is exactly the same as what Tim Urban devised. The formula is kinda buried near the end of the article under a section titled “Calculation of expected number of individuals sharing DNA IBD. Yep, that translates to “how many cousins do we have.”

Critically, this section specifies that a birth rate of 2.5 is used to produce the cousin numbers in the table. I’ll come to that later, as it may be more appropriate to the United States than to Irish people over a certain age.

I actually found the academic explanation easier to follow than Tim’s post. So now I’ll try to explain the formula as opposed to using it blindly to produce numbers. The formula is expressed using i as the degree of cousinship and z as birth rate. :

Why 2 to the power of i? For our first cousins, we share two sets of grandparents, therefore two couples. For our secound cousins we share four sets of great grandparents, therefore four couples. And so on up the generations, where the number of couples is equal to 2 to the power of the degrees of cousinship.

Why (z-1)? If one set of grandparents have four children, one of those children is our parent and that person’s children are ourselves and our average of three siblings so we need to remove 1 from the birth rate.

z to the power of i is the total number of non-ancestral cousins that stem from a particular ancestral generation. Thus the number of cousins from one set of ancestors is the product of (z-1) and z^i.  The full total is then achieved by multiplying by the number of ancestral couples for the degree of cousinship.

The table below has the totals for three different fertility rates in recent Irish history. The fourth line uses the fertility rate chosen by Denn et all for their numbers.

As befits a scholarly study, Denn et al also specify the assumptions that they made to simplify calculations. “Perfect survivorship” jumps out for me i.e. that every offsping at every generation lives to produce the average number of children. Given that the user base of 23andMe is predominantly American and with disposable income for personal DNA testing, this might be a reasonable assumption for their purposes. I do think we Irish need to pay some attention to the impact of our mortality rates up to the 1970s.

They also list other assumptions which I’d like to consider. But I’ll save that for another post in this “how many Irish cousins” series. In my next post I want to look at the quoted numbers from AncestryDNA.

How many Irish cousins: according to Tim Urban – Part 1 in Series

Tim Urban wrote an entertaining blog post in 2014 on calculating a ballpark number of cousins based on your country’s average birth statistics. His formula breaks down the totals by degree of cousin i.e. 1st/2nd/3rd/4th and outward.

Tim calculates numbers for USA, UK, Canada and a few other countries, but not Ireland – so I figured I’d crunch the figures for green.

Tim’s formula iswhere “n” is the fertility rate and “d” is the degree of cousin.

The 2013 fertility rates from NationMaster report fertility rates of 1.9 for the U.K, 2.01 for Ireland, and 2.06 for the U.S. Here are the totals for the three countries, with Ireland in the middle line.

Figures for 2013

For now, ignore the eye-watering totals at fifth cousin and beyond. I took one look at the predicted total of 4.1 for First Cousins and rechecked my calculation for a mistake. It just looked suspiciously low to me. My reaction wasn’t based solely on my own family. From the life-cycle of weddings, christenings, and funerals, you tend to have a passing familiarity of the family structures of your friends and neighbours.

A moment’s thought reminded me that Ireland’s birth rate has dropped in recent decades from one of the highest in Europe to closer to the average. The nearest published rate I could find for my birth year was 1970’s rate of 3.87 which predicts 22 first cousins using Tim’s formula. Quite the difference! The same publication reported fertility rates of 2.08 for 1989 and about 3.9 for 1950.

Here are the figures for Ireland only for those years:

figures Ireland 1970 and other yearsNow you can shift your eyes right, and look at the totals of more distant cousins. Again, I thought I’d got the calculations wrong, this time because those numbers were so big. Thankfully, John Reid of Anglo-Celtic Canada Connections has taken Tim’s formula and applied across a wide range of fertility rates, so a spot check stopped me from fretting.

I’ve seen other reported projections of numbers of cousins, including one from Ancestry and one from 23andMe. I’ll address them in other blog posts and compare them with Tim’s.

How your DNA kit can fail to process

The number of people submitting DNA samples is rising, with a corresponding rise in the number of people on forums reporting that they have been asked by the test company to submit another sample.
AncestryDNA say in their FAQ that:

During the testing process, each DNA sample is held to a quality standard of at least a 98% call rate. Any results that don’t meet that standard may require a new DNA sample to be collected.

A call rate? I don’t see that term explained elsewhere in their FAQ so I thought it worth a blog post on how the companies determine that a particular kit doesn’t meet their quality standards. As well as “call rate”, I’ll explain a few other terms along the way.
When the test lab receives your DNA sample, they do not analyse the entire collection of DNA in your swab or spit. The technology right now doesn’t scale up to deal with every gene. Instead, your sample is broken down to identify a set of very specific positions within your chromosomes.

We humans are basically the same as each other across 99.9% of our DNA sequence. That means that at many positions your DNA will be identical between you, your sister Sally, and that guy who delivered the kit to your door. Unless “that guy” is also your brother, then most parts of your DNA sequence will not allow the lab to distinguish between your siblings and strangers at the door.

Instead, the testing companies target positions where there is likely to be genetic variation within our species. These variants at a particular position are known as Single-Nucleotide Polymorphisms, or SNPs for short. Don’t worry too much about the mouthful, but be aware that there are only four types of nucleotides and they go by the code names of A, C, G or T.

You’re probably familiar with the DNA helix, so where do these codes fit into it? In this picture of chromosomes, there are two long strings coiled around each other. Each string is a sequence of nucleotides: A, C, G or T, in a particular order.

So lets take the sequence AGTCAAGTCAAGTC. You and big sis Sally share that sequence. But what about the delivery guy? His sequence is AGTCAAGTCAAGTC. Is it the same?

Let’s line them up to see:
One nucleotide at a particular position is different between you and that guy.
Suppose we looked at the same sequence for the delivery guy’s brother. it would probably have C instead of T too.

But wait! It might not. The delivery guy and his brother could have inherited the DNA at that precise position from different parents. Or they could differ due to the random shuffling in DNA during reproduction. Due to inheritance and random variation, the labs need to test hundreds of thousands of alleles to let statistics take over and make us confident that we can use use the average variance across SNPs to determine how much we match our siblings versus our non-relatives.
Identifying the nucleotide code of each SNP is known as “calling” the SNP.

Now, here’s the problem. Sometimes the lab equipment cannot identify the nucleotide code.  This can be due to contamination, or to equipment error, or to the way that DNA is broken down for the analysis.

If an SNP cannot be identified, this is known as a “no-call“.

The overall call rate of your DNA sample is the number of SNPs that were coded divided by the total number of SNPs that the lab tried to code. So if 2% fail to be coded then the kit will probably be retested, and if it continues to fail, you will be asked to send a new sample. Yep, that’s the delivery guy back at the door:

Your 5th cousin isn’t a DNA match? Don’t jump to conclusions!

In Ireland, the “townland” is a small area of land supporting a number of households. The townland of Creeny (or Creeney) is in County Cavan. In 1901 there were six households in Creeny. This reduced to five in 1911.

Creeny

Two of the households have the surname of Gamble. Thomas and Mary Gamble sold their farm in 1920 and moved to the nearby town of Belturbet. They were my great grandparents, and I’m still trying to establish the precise relationship between the two Gamble families who were on neighbouring farms from at least the 1850s up to 1920.

My aunt told me the families were distant relations, but so far back that she didn’t know the common ancestry. So I built the Gamble line of both families as far back as Irish records allow, which unfortunately gets very sparse before the mid 1850s. I know that in 1855 there were two households headed respectively by John (my known ancestor) and Edward (ancestor of the “other” Gambles). Jump back to 1825 and there is only one Gamble in Creeny according to the tithe records: a Thomas Gamble. My guess was that John and Edward were the sons of Thomas, who subdivided his farm between them. But without paper evidence I’m stuck.

Enter DNA. Or the lack thereof.

I was very interested when I spotted the descendents of the “other” Gambles in a family tree on Ancestry. The owner of the tree told me that her brother had his DNA kit on Ancestry. I was crestfallen to find that he wasn’t in my DNA matches amongst the “5th-8th Cousins”. Does that mean there is no familial connection? That the proximity of two families of the same relatively unusual surname was by chance? No, not at all.

My theory is that her brother and I are descended from my great-great-great-great grandparents, so we would be fifth cousins. It just so happens that the Ancestry White Paper on DNA matching uses the example of 5th cousins to show clearly that some descendants at that distance may have matching segments, but some will not. This is due to the random nature of how DNA shifts around during transmission through the generations. The nicely illustrated example is in the early pages of the document and is worth a read.

The only conclusion that I can draw is that I can draw no conclusion from the absence of a DNA match at this distance. Of course, turning this on it’s head is more dispiriting. If the brother did have a small match to me, such as 6 CM, then it could also have been by chance.

In the case of the Gambles of Creeny, I personally attach weight to the family lore that the households were distant cousins. In rural areas it was important to keep track of such things. I’ve added the “other” Gambles to my family tree with a note that the connection is still speculation.