^{1}

^{2}

^{3}

^{4}

The authors have declared that no competing interests exist.

In this paper we describe how to efficiently record the entire genetic history of a population in forwards-time, individual-based population genetics simulations with arbitrary breeding models, population structure and demography. This approach dramatically reduces the computational burden of tracking individual genomes by allowing us to simulate only those loci that may affect reproduction (those having non-neutral variants). The genetic history of the population is recorded as a succinct tree sequence as introduced in the software package msprime, on which neutral mutations can be quickly placed afterwards. Recording the results of each breeding event requires storage that grows linearly with time, but there is a great deal of redundancy in this information. We solve this storage problem by providing an algorithm to quickly ‘simplify’ a tree sequence by removing this irrelevant history for a given set of genomes. By periodically simplifying the history with respect to the extant population, we show that the total storage space required is modest and overall large efficiency gains can be made over classical forward-time simulations. We implement a general-purpose framework for recording and simplifying genealogical data, which can be used to make simulations of any population model more efficient. We modify two popular forwards-time simulation frameworks to use this new approach and observe efficiency gains in large, whole-genome simulations of one to two orders of magnitude. In addition to speed, our method for recording pedigrees has several advantages: (1) All marginal genealogies of the simulated individuals are recorded, rather than just genotypes. (2) A population of

Sexually reproducing organisms are related to the others in their species by the complex web of parent-offspring relationships that constitute the pedigree. In this paper, we describe a way to record all of these relationships, as well as how genetic material is passed down through the pedigree, during a forwards-time population genetic simulation. To make effective use of this information, we describe both efficient storage methods for this embellished pedigree as well as a way to remove all information that is irrelevant to the genetic history of a given set of individuals, which dramatically reduces the required amount of storage space. Storing this information allows us to produce whole-genome sequence from simulations of large populations in which we have not explicitly recorded new genomic mutations; we find that this results in computational run times of up to 50 times faster than simulations forced to explicitly carry along that information.

This is a

Since the 1980’s, coalescent theory has enabled computer simulation of the results of population genetics models identical to that which would be produced by large, randomly mating populations over long periods of time without actually requiring simulation of so many generations or meioses. Coalescent theory thus had three transformative effects on population genetics: first, giving researchers better conceptual tools to describe ^{4} diploids in stable populations and populations growing up to around 5 × 10^{5} individuals, using the output to explore the relationship between the genotype/phenotype model and GWAS outcomes.

Modern computing power easily allows simulations of birth, death and reproduction in a population having even hundreds of millions of individuals. However, if our interest lies in the resulting genetic patterns of variation—and often, the point of such simulations is to compare to real data—then such simulations must record each individual’s genome. As samples of most species’ genomes harbor tens or hundreds of millions of variant sites, carrying full genotypes for even modest numbers of individuals through a simulation can quickly become prohibitive. To make matters worse, a population of size

However, it is thought that most genetic variation is selectively neutral (or nearly so). By definition, neutral alleles carried by individuals in a population do not affect the population process. For this reason, if one records the entire genealogical history of a population over the course of a simulation, simply laying down neutral mutations on top of that history afterwards is equivalent to having generated them during the simulation: it does not matter if we generate each generation’s mutations during the simulation, or afterwards. To add mutations after the fact, we need to know the genealogical trees relating all sampled individuals at each position along the genome. Combined with ancestral genotypes and the origins of new mutations, these trees completely specify the genomic sequence of any individual in the population at any time. To obtain this information, we record from forward simulation the

The idea of storing genealogical information to speed up simulations is not new. It was implemented in AnA-FiTS [

In this paper, we describe a storage method for

The strategy described above is only of interest if it is computationally feasible. Therefore, we begin by benchmarking the performance improvement achieved by this method, implemented using the forwards-time simulation library

To measure the performance gains from recording the pedigree we ran simulations both with and without recording. (Although we record more than just the parent–offspring relationships of the pedigree, for brevity we refer to the method as “pedigree recording”). All simulations used

Deleterious mutations were introduced at rate

Pedigree tracking dramatically reduced runtimes, as shown in

Line color represents different diploid population sizes (^{4} timed out for region sizes larger than 10^{3}.

Each line shows the ratio of total run times of standard simulations to those of simulations with pedigree recording. Data points are taken from

In our implementation, simulations with pedigree recording used substantially more RAM than simple forward simulations (see

We now explain what we actually did to achieve this 50× speedup. The “pedigree recording” simulations above recorded information about each new individual in a collection of tables that together define a

The leftmost panels show the tree sequence pictorially in two different ways: (top) a sequence of tree topologies; the first tree extends from genomic position 0 to 5, and the second from 5 to 10; and (bottom) the edges that define these topologies, displayed over their corresponding genomic segment (for instance, the edge from node 2 to node 4 is present only on the interval from 0 to 5). The remaining panels show the specific encoding of this tree sequence in the four tables (nodes, edges, sites and mutations).

The

The

Recovering the sequence of trees from this information is straightforward: each point along the genome at which the tree topology changes is accompanied by the end of some edges and the beginning of others. Since each edge records the genomic interval over which a given node inherits from a particular ancestor, to construct the tree at a certain point in the genome we need only retrieve all edges overlapping that point and construct the corresponding tree. To modify the tree to reflect the genealogy at a nearby location, we simply remove those edges whose intervals do not overlap that location, and add those new edges whose intervals do. Incidentally, this property that edges naturally encode

Given the topology defined by the nodes and edges,

This encoding of a sequence of trees and accompanying mutational information is very concise. To illustrate this, we used _{e} = 10^{4} and per-base mutation and recombination rates of 10^{−8} per generation. This resulted in about 1 million distinct marginal trees and 1.1 million infinite-sites mutations. The HDF5 file encoding the node, edge, site and mutation tables (as described above) for this simulation consumed 157MiB of storage space. Using the

Given a set of node and edge tables as described above, there are only two requirements that ensure the tables describe a valid tree sequence. These are:

Offspring must be born after their parents (and hence, no loops).

The set of intervals on which each individual is a child must be disjoint.

A pair of node and edge tables that satisfy these two requirements is guaranteed to uniquely describe at each point on the genome a collection of directed, acyclic graphs—in other words, a forest of trees. For some applications it is necessary to check that at every point there is only a

The facilities for working with succinct tree sequences are implemented as part of the

The Tables API is primarily designed to facilitate efficient interchange of data between programs or between different modules of the same program. We adopted a ‘columnar’ design, where all the values for a particular column are stored in adjacent memory locations. There are many advantages to columnar storage—for example, since adjacent values in memory are from the same column, they tend to compress well, and suitable encodings can be chosen on a per-column basis [

The

To demonstrate the flexibility provided by the Tables API and provide an implementation that decouples forward simulation internals from transfer of data to

To record the genealogical history of a forwards time simulation, we need to record two things for each new chromosome: the birth time; and the endpoints and parental IDs of each distinctly inherited segment. These are naturally stored as the

We use

Simulates a randomly mating population of

[Initialisation.] Set

[Generation loop head: new node.] Set

[Choose parents.] Set

[Record edges.] Call

[Individual loop.] Set

[Simplify.] Call

[Generation loop.] Set

We begin in W1 by creating new node and edge tables, and setting our population ^{th} individual (with node ID _{j}) by creating a new node with birth time _{a} and _{b}, and choose a chromosomal breakpoint _{a}, and another recording that the parent of _{b}. Step W5 then iterates these steps for each of the

This algorithm records only topological information about the simulated genealogies, but it is straightforward to add mutational information. Mutations that occur during the simulation can be recorded by simply storing the node in which they first occur, the derived state, and (if not already present) the genomic position of the site at which it occurs. This allows selected mutations, that the forwards time simulation must generate, to be recorded in the tree sequence. Neutral mutations can be generated after the simulation has completed, thus avoiding the cost of generating the many mutations that are lost in the population. This is straightforward to do because we have access to the marginal genealogies.

Since a tree sequence can record the history of genetic inheritance in any situation (requiring only unambiguous inheritance and no time travel), any individual-based population genetics simulator can maintain a tree sequence with only a little bookkeeping. We have furthermore provided several tools to minimize this bookkeeping, only requiring one-way output at the birth of each new individual. Concretely, to record a tree sequence, including mutations, a simulator must record for each new genome (so, twice for each new diploid individual):

the birth time of the genome in the Node Table,

the segments the genome inherits from its parental genomes in the Edge Table,

the locations of any new mutations in the Site Table,

and the derived state of these new mutations in the Mutation Table (as well as the identity of this genome the mutations appeared in).

Each of these can be simply appended to the ends of the respective tables. Besides this, simplification should be run every once in a while (e.g., every 100 generations). Before simplification, time in the Node Table must be translated to “time ago” if it is not already. (This was avoided in Algorithm W since there

In other words, to record a tree sequence, a simulation needs only to know (a) which genomes recombined to produce each new genome, and how; (b) the locations and results of any new mutations on each genome; and (c) the identities of every currently alive individual at each time simplification occurs.

Applications may also want to store more information not fitting into an existing column of the tables, such as the selection coefficient of a mutation, or the sex of an individual. This (and, indeed, arbitrary information) can be stored in the

It is desirable for many reasons to remove redundant information from a tree sequence. To formalize this: suppose that we are only interested in a subset of the nodes of a tree sequence (which we refer to as our ‘samples’), and wish to reduce this input tree sequence to the smallest one that still completely describes the history of the specified samples, having the following properties:

All marginal trees must match the subtree of the corresponding tree in the input tree sequence that is induced by the samples.

Within the marginal trees, all non-sample vertices must have at least two children (i.e., unary tree vertices are removed).

Any nodes and edges not ancestral to any of the sampled nodes are removed.

There are no adjacent redundant edges, i.e., pairs of edges (

Simplification is essential not only for keeping the information recorded by forwards simulation manageable, but also is useful for extracting subsets of a tree sequence representing a very large dataset.

We implement simplification by starting at the end of the simulation, and moving back up through history, recording in the new tree sequence only that information necessary to construct the tree sequence of the specified individuals. This process of tracing ancestry back through time in a pedigree was the motivation for Hudson’s coalescent simulation algorithm [

Conceptually, this works by (a) beginning by painting the chromosome in each sample a distinct color; (b) moving back through history, copying the colors of each chromosome to the portions of its parental chromosomes from which it was inherited; (c) each time we would paint two colors in the same spot (a coalescence), record that information as an edge and instead paint a brand-new color; and (d) once all colors have coalesced on a given segment, stop propagating it. This “paint pot” description misses some details—for instance, we must ensure that all coalescing segments in a given individual are assigned the

Following the “paint pot” description in the text, we begin by coloring J and K’s genomes in red and blue respectively, then trace how these colors were inherited back up through the pedigree until they coalesce. To aid in this, the smaller colored chromosomes on either side of each solid arrow show the bits inherited from each of the two parental chromosomes, with genomic position 0.0 on the bottom and 1.0 at the top. Each time a red and a blue segment overlap, a coalescence occurs, two edges are output, and we stop propagating that segment. For instance, both J and K inherit from H between 0.5 and 0.9, which resulted in the first two edges of the simplified table of

More concretely, the algorithm works by moving back through time, processing each parent in the input tree sequence in chronological order. The main state of the algorithm at each point in time is a set of ancestral lineages, and each lineage is a linked list of ancestral segments. An ancestral segment (

Any simulation scheme that records data into tables, as Algorithm W does, has its genealogical history available at any time as a tree sequence. This has two additional advantages: First, simplification can be run periodically through the simulation, if we take the set of samples to be the entire currently alive population. This is important in practice as it keeps memory usage from growing linearly (and quickly) with time. Second, the simulation can be ^{3} generations. Thus, there is a memory-versus-speed tradeoff—simplifying more often would keep fewer extinct nodes and edges in memory.

Figs

First: how much memory do simplified tree sequences require? Consider a simulation of a Wright–Fisher population of

If ^{2}) space to store the complete history of the simulation, but only

What if

What about mutations? Forwards-time generation of infinite-sites mutations with total mutation rate per generation ^{4} and

Since each mutation is stored only as a single row in the mutation table, and at most one row in the site table, the space required for

How does the computation

In this paper, we have shown that storing pedigrees and associated recombination events in a forwards-time simulation not only results in having available a great deal more information about the simulated population, but also can speed up the simulation by orders of magnitude. To make this feasible, we have described how to efficiently store this information in numerical tables, and have described a fundamental algorithm for simplification of tree sequences. Conceptually, recording of genealogical and recombination events can happen independently of the details of simulation; for this reason, we provide a well-defined and well-tested API in Python for use in other code bases (a C API is also planned).

The tree sequences produced by default by this method are very compact, storing genotype

Another attractive feature of this set of tools is that it makes it easy to incorporate

The methods described here for efficiently storing tree sequences may prove useful in other fields. We have focused on the interpretation of tree sequences as the outcome of the process of recombination, but in principle, we can efficiently encode any sequence of trees which differ by subtree-prune-and-regraft operations. Since each such operation requires a constant amount of space to encode, the total space required is

In this article, we applied our methods for storing trees to the problem of pedigree recording in a forward-time simulation. However, the method applies to any simulation scheme generating nodes and edges. For example, one could use the methods described here to generate succinct tree sequences under coalescent processes not currently implemented in

Another application of our methods would be the case of simulating coalescent histories conditional on known pedigrees. The standard description of the Wright-Fisher coalescent averages over pedigrees. However, conditional on a realized pedigree, the distribution of coalescent times in the recent past differs from that of the unconditional coalescent [

In preparing this manuscript, we debated a number of possible terms for the embellished pedigree, i.e., the “pedigree with ancestral recombination information”, the object through which each tree of a tree sequence is threaded. Etymological consensus [

We implemented simulations and the connection to

Code for all simulations and figures is available at

The supplementary text contains (A) benchmarking of run time and memory usage on simulations without selection; (B) benchmarking of memory usage with selection; (C) an analysis of the effect of simplification interval on run times; (D) details for the

(PDF)

Thanks to Gil McVean, Jared Galloway, Brad Shaffer, and Evan McCartney–Melstad for useful discussions.