^{1}

^{2}

^{1}

^{1}

^{*}

The authors have declared that no competing interests exist.

Analyzed the data: LF TW. Wrote the paper: LF TW. Designed the study: TW LF. Performed calculations and derived theorems: LF FD. Performed computer simulations: LF TW.

The coalescent with recombination is a fundamental model to describe the genealogical history of DNA sequence samples from recombining organisms. Considering recombination as a process which acts along genomes and which creates sequence segments with shared ancestry, we study the influence of single recombination events upon tree characteristics of the coalescent. We focus on properties such as tree height and tree balance and quantify analytically the changes in these quantities incurred by recombination in terms of probability distributions. We find that changes in tree topology are often relatively mild under conditions of neutral evolution, while changes in tree height are on average quite large. Our results add to a quantitative understanding of the spatial coalescent and provide the neutral reference to which the impact by other evolutionary scenarios, for instance tree distortion by selective sweeps, can be compared.

Coalescent theory is a central part of modern population genetics

The classical coalescent is a binary, rooted, unordered tree with a fixed number

A: Tree of size

The spatial coalescent is the sequence

Here, we investigate the impact of single recombination events upon some measures of tree topology and shape. By

A: Non-silent recombination changes tree topology. In the case shown, also

Our goals are to formalize these concepts, to characterize in more detail the effect of single recombination events upon tree shape and to quantify the relative frequencies of drastic, mild and silent events. We explicitly calculate the probabilities of changes in height or root balance induced by a single recombination event. Our results are based on the assumption of a standard neutral model of constant population size. This means that for each coalescent event two lineages are chosen at random to merge. Further, the timing of events is exponentially distributed with a rate which, after re-scaling by population size

In Results Section (a), we define a probability density for the trees in the spatial coalescent and we explain the difference between pointwise marginal trees

We remind the reader that the spatial coalescent is a non-Markovian process and not completely determined by transitions of any finite order. However, it is a homogeneous process. Therefore, first-order transition probabilities are well-defined and independent of the position in the sequence. Here, we compute first order probabilities for single recombination events from one tree to the next, averaging over all trees of the ARG which are not directly involved in the recombination event considered. Therefore, our results hold for the spatial coalescent as described by the ARG

We consider a sample of

Furthermore, all trees in

Modeling recombination as an ARG

Given a tree

Recombination events may change the shape of the tree. The local tree at position

Sequence

The standard coalescent without recombination is recovered when looking at the tree for a single position

On the other hand, picking a tree from a random sequence

Shown are the height distribution of trees in

One simulation run using

In fact, the two distributions differ by weights which are proportional to the length

The argument leading to eq (3) can be made rigorous under the assumption of infinitely long chromosomes, using the fact that the coalescent with recombination is an ergodic process

Furthermore, any marginal tree obtained from an ARG (conditioned on the number of recombinations in the sequence) by choosing randomly an ancestral lineage for every recombination event is distributed according to

Note that the two distributions,

The right hand side of

The distributions

Recombination can be interpreted as a random prune-and-regraft event on the tree

We denote the root node by

The square indicates the new node created by re-grafting. It forms the new root in cases U, D and N. In case S, an existing internal node becomes the new root (empty square overlaid on node

U (‘up’): a prune-and-regraft event on the root branches generates a higher root without changing the topology;

D (‘down’): a prune-and-regraft event on the root branches generates a lower root without changing the topology;

N (‘new’): pruning a branch below the root branches and re-grafting onto the ancestral branch of the root creates a new root, while the old root becomes internal node

S (‘substitute’): pruning a root branch and re-grafting onto a branch in the subtree of

In fact, for the root to change height it must either be shifted (cases U and D) or be replaced (cases N and S). If the root is replaced, it can become an internal node

We denote the probabilities of these events by

Given a coalescent tree of size

Probabilities

This result can also be derived directly by counting ARGs, since

Focusing now on pruning of the root branches, we obtain

In particular, the average number of direct descendants of the root at level

In contrast to

The probabilities

The same is true when the root is only shifted. Thus,

Hence, by subtraction,

Note that the identities (17) and (18), being topological in nature, are also valid for models with variable population size. A related result about the probability that a random recombination event leaves tree height unchanged (

The variation in height

Taking expectations, the average change in height after one of these events is

Let

In this section we calculate the transition probabilities

Let the

Now we calculate the probability conditioned on the value

Taking the ratio of tree counts, we obtain an hypergeometric distribution

Finally, inserting the results (32) and (35) into (31), we obtain

As before, we introduce

Distribution

Now we consider all possible recombination events that change

The contribution for events of type S can be obtained using the symmetry properties of the ARG. In fact, an ARG with a recombination event of type S changing

This result is essentially the transpose of the one shown in

Finally, the transition probability is

This distribution is shown in

Counting ARGs we now determine the fraction of

Hidden recombination events are caused by pruning and re-grafting on the same branch (see

This means that the fraction of hidden recombination events is of the order

Using the same technique of counting ARGs also the fraction of silent recombination events (i.e. events that do not change topology but that may change branch lengths) can be obtained. We start by counting events that are silent but not hidden. Given a tree, select a branch for pruning. Then, there are exactly two ways for re-grafting: either on the branch immediately above or on the branch immediately below the old parent node of the pruned branch (

Therefore,

Note that the following holds:

An intuitive explanation is the following: for any pruning point, there are two possible ways for re-grafting such that tree topology remains unchanged and there is exactly one way for re-grafting which leads to an increase of tree height. Therefore,

Since the spatial coalescent is a non-Markovian process, it is important to know over which chromosomal distances correlation and statistical dependence among trees persist. Correlation between trees, measured by any well-behaved tree statistic, decreases with distance. An interesting question is how quickly recombination reduces correlation. The answer depends on the particular statistic which is employed to measure correlation. Topology based statistics, such as

The red line is the approximation

We use our above results regarding events of type U, D, N, S and R to give a quantitative answer. The idea is to approximate the correlation length for a statistic by the inverse of the probability of recombination events that have a strong impact on this statistic.

Events of type U or D change height, but leave the topology unchanged. Events of type R preserve height but alter topology. Events of type N or S may change both, height and topology. They also lead to the fastest decay of correlation.

The average number of recombination events before an event of type N or S occurs is the inverse of this probability. This quantity is a rough estimate for the correlation length of tree shape. The numerical values of

To translate this into physical length, we assume that the distance between two consecutive recombination events is exponentially distributed with mean

To estimate the correlation length of

The run-length is longer for more imbalanced trees, but always on the order of a few recombination events (between

We now consider correlation in tree height. Height can change by events U,D,N and S. The average change in height is the same,

Since

For the physical correlation length we have.

For the case

Finally, we briefly comment that linkage disequilibrium and haplotype block size depend strongly on the number and distribution of mutation and recombination events along coalescent trees, i.e. they depend strongly on tree topology and length. Since topology can in practice only be indirectly estimated from polymorphism patterns, not all changes in topology are actually visible for these statistics. The correlation lengths estimated from experimental data will tend to be larger than the theoretical estimates presented here. Assuming that haplotype blocks are mostly delimited by ‘drastic’ recombination events, involving a change of topology, we estimate the size of these haplotype fragments

The average size is then

The class of drastic recombination events that should be considered to determine

We have considered the effect of single recombination events on coalescent tree topology and explicitly determined the probability with which recombination triggers ‘drastic’ changes. We consider a change to be drastic if it leads to a change of tree height or of tree imbalance. These types of events are of practical interest because both have an effect on the pattern of polymorphic sites which are informative for genealogical reconstruction and evolutionary inferences. The primary effect of height change is upon the number of mutations, while a change in tree imbalance primarily affects the mutation site frequency spectrum.

Our results show important qualitative differences for the two types. The average change in height is quite drastic per se (50% of average tree height), while the average change in imbalance is quite mild, with large jumps occuring only very rarely. Our results hold for the standard neutral model, i.e. a model with constant population size and without substructure. As such, our results may serve as the analytical reference case for constructing formal tests of the neutral evolution hypothesis. For instance, the probabilities of height or topology change are markedly altered in the presence of selective sweeps, i.e. the fast fixation of a mutant allele due to positive selection. Recombination close to the sweep site, where tree height is severely reduced

Population substructure is another important case of deviation from the standard neutral model. Restricted gene flow between sub-populations strongly affects the transition probabilities of root imbalance, but less the distribution of height change. A more detailed discussion of the impact of these evolutionary scenarios upon a test statistic of the neutral evolution hypothesis is given in

We have derived a number of further results which shed more light on the details and consequences of recombination. We analysed the correlation length between trees on a recombining chromosome and showed that topological correlation is generally longer-ranging than correlation in tree height. Still, for both types very few recombination events – on the order of ten – are sufficient to unlink the genealogical histories of two genomic fragments, given standard neutral conditions. The calculations also make clear that correlation length (number of recombinations) scales logarithmically in

It is perhaps surprising to see that a considerable fraction of recombination events remains hidden. Even for large sample sizes, about

Analyzing root imbalance in more detail, we found that the distribution of

Some of the quantities studied here involve counting problems of ancestral recombination graphs with a single recombination event. These problems are related to counting problems of phylogenetic networks

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

We would like to thank Jeff Thorne and two anonymous reviewers for very constructive comments, and A. Klassmann and S. Ramos-Onsins for numerous discussions.