^{1}

^{*}

^{2}

^{2}

^{3}

^{2}

AH, OFC, and MHS conceived and designed the study. AH and MHS analyzed the data. AH developed coal-HMM methodology and software. OFC performed simulation studies. OFC and MHS contributed to coal-HMM methodology. TM wrote simulation software. TM and MHS prepared alignments for data analysis. AH, OFC, and MHS wrote the paper.

The authors have declared that no competing interests exist.

The genealogical relationship of human, chimpanzee, and gorilla varies along the genome. We develop a hidden Markov model (HMM) that incorporates this variation and relate the model parameters to population genetics quantities such as speciation times and ancestral population sizes. Our HMM is an analytically tractable approximation to the coalescent process with recombination, and in simulations we see no apparent bias in the HMM estimates. We apply the HMM to four autosomal contiguous human–chimp–gorilla–orangutan alignments comprising a total of 1.9 million base pairs. We find a very recent speciation time of human–chimp (4.1 ± 0.4 million years), and fairly large ancestral effective population sizes (65,000 ± 30,000 for the human–chimp ancestor and 45,000 ± 10,000 for the human–chimp–gorilla ancestor). Furthermore, around 50% of the human genome coalesces with chimpanzee after speciation with gorilla. We also consider 250,000 base pairs of X-chromosome alignments and find an effective population size much smaller than 75% of the autosomal effective population sizes. Finally, we find that the rate of transitions between different genealogies correlates well with the region-wide present-day human recombination rate, but does not correlate with the fine-scale recombination rates and recombination hot spots, suggesting that the latter are evolutionarily transient.

The recent evolutionary history of the human species can be investigated by comparative approaches using the genomes of the great apes: chimpanzee, gorilla, and orangutan [

Comparative analyses of multiple alignments of small fragments of human, chimpanzee, gorilla, and orangutan sequence have revealed that the human genome is more similar to the gorilla genome than to the chimpanzee genome for a considerable fraction of single genes [_{HC} of 2–10 times the human present-day effective population size _{H} = 10,000 [

Top: Genealogical relationship of human, chimpanzee, gorilla, and orangutan. Speciation times are denoted _{1,} _{1} + _{2}, and _{1} + _{2} + _{3}. Population sizes of human, chimpanzee, and gorilla are denoted _{H}, _{C}, and _{G}, while the HC and HCG ancestral population sizes are denoted _{HC} and _{HCG}.

Bottom: Each of the four hidden states in the coal-HMM corresponds to a particular phylogenetic tree. In state HC1, human and chimpanzee coalesce before speciation of human, chimpanzee, and gorilla, i.e., before _{1} + _{2}. In states HC2, HG, and CG, human, chimpanzee, and gorilla coalesce after speciation of the three species, i.e., after _{1} + _{2}. In HC2, the human and chimpanzee lineages coalesce first, and then the HC lineage coalesces with gorilla. In state HG, human and gorilla coalesce first, and in state CG, chimpanzee and gorilla coalesce first. The hidden phylogenetic states cannot be observed from present-day sequence data, but they can be decoded using the coal-HMM methodology.

Whole genome sequences of gorilla and orangutan will soon supplement the already available whole genome sequences of human and chimpanzee [

In this paper we describe a hidden Markov model (HMM) that allows the presence of different genealogies along large multiple alignments. The hidden states are different possible genealogies (labeled HC1, HC2, HG, and CG in _{HC} and _{HCG}, and speciation times _{1} and _{2} (see

The coal-HMM divides a multiple alignment into four types of segments corresponding to the four phylogenetic states HC1, HC2, HG, and CG. The probability of making a transition from state HC1 to any of the other states is

A visual impression of the preferred phylogenetic state along the alignment is obtained by dividing the alignment sites according to how they partition the species (

Divergent Sites Provide Information about Genealogy

(From bottom to top) Analysis of the first 100 kb from Target 1.

(Bottom) Site information without outgroup: sites shared by HC in red, by HG in blue, and by CG in green (compare first, third, and last columns in

(Second from bottom) Site information with outgroup: sites strongly supporting states HC1 or HC2 in red, HG in blue, and CG in green (compare second, third, and last columns in

(Third from bottom) Posterior probabilities: Coloring as in second from bottom, except that state HC2 is dark red.

(Top) Fine-scale recombination rate estimates (log scale). The vertical lines mark subdivisions of the multiple alignment due to more than 50-base-pair deletions in one species (see “Data” in

The posterior probabilities of the phylogenetic states are shown in the panel second from the top in

Parameter estimates with standard errors for each target are shown in _{HCGO}

Estimates with associated standard errors of the HMM and population genetics parameters for the five targets. Top left plot shows the HMM transition rates, top right plot the genetic divergence times in million years (assuming orangutan divergence 18 Myr ago), bottom left plot the speciation times in million years, and bottom right plot the ancestral effective population sizes, again assuming orangutan divergence of 18 Myr and a generation time of 25 y for all species throughout the HCGO divergence.

The divergence times in

Target 1 was also analyzed after filtering out all putative CpG mutations. This reduces the number of polymorphic sites by 17% and removes relatively more of the sites supporting HG and CG groupings than sites supporting HC grouping (states HC1 and HC2), as expected if some of these sites are hypermutable. However, the removal of putative CpG sites does not change the estimated time in alternative states or effective population sizes and only slightly decreases the estimated HC speciation time. The time spent in the alternative states HC2, HG, and CG is also only slightly affected (

While estimates of the effective population sizes and speciation times do not differ significantly between the four targets, there are large differences in the average number of base pairs and the percentage of the alignment in state HC1 (

Mean Fragment Lengths for Basic State HC1 Correlate with Pedigree Recombination Rate

A coal-HMM analysis of more than 250 kb of X-chromosome sequence data used by [_{HCG} for the X chromosome data is close to the expected (see

We observe several adjacent sites that support alternative state HG (blue lines), corresponding to coalescence of human and gorilla before coalescence with chimpanzee.

Studying the genealogical relationship of human, chimpanzee, and gorilla along their genomes makes it possible to assign genealogical relationships to segments of the genome with high posterior probabilities (

Important insights can already be gained from analyzing 1.9 million base pairs from four autosomal segments of the genome, and 0.25 million base pairs from the X chromosome. We consistently find that the speciation time of human and chimpanzee is close to the minimum of the range previously predicted (4 Myr) if we assume a human–orangutan divergence of 18 Myr. If the effective population size _{HCGO} was large, then there is also a large variation in orangutan divergence. However, 18 Myr ago, the size of ancestral segments was very short (a few base pairs), so the variation in divergence times over kilobase-long fragments as studied here is expected to be small. The HCG speciation time is estimated to have occurred approximately 2 Myr earlier than the HC speciation time (i.e., 6 Myr ago). However, the divergences along the genome of human and chimpanzees are generally much older than 4 Myr, varying between 4 and 9 Myr. This can be explained solely by ancestral population sizes on the order of 50,000 in the HC and HCG ancestors, and one does not need to invoke a gradual speciation process with continued gene flow (introgression) to explain the autosomal data, as also noted by Innan and Watanabe [

Our molecular dating estimates are generally in agreement with a large number of studies using different calibration points; Kumar et al. [

The present implementation of the coal-HMM assumes that, conditional on the genealogy, sites are independent and mutations can be described by a continuous time Markov chain. This assumption is violated for CpG dinucleotides, which are more prone to mutation due to methylation. The assumption is also violated if the mutation rate varies along the alignment, resulting in regions with highly variable sites that are subject to recurrent mutations. Recurrent mutations can be detected by adding a further outgroup [

The 250,000 base pairs aligned on the X chromosome have a larger than expected fraction of base pairs in state HC1 (around 80%) if the long-term effective population size of the X chromosome is three-quarters that of the autosomes. There is strong evidence that a small segment on the X chromosome supports an alternative genealogy (state HG). However, the effective population size is reduced by much more than the expected 25% assuming instant speciation, equal contribution of sexes, and no selection. Non-equal contribution of sexes can at most reduce the effective population size to 50% that of autosomes (if females have much larger variance in reproductive success than males). Prolonged speciation and selection may both explain the discrepancy. The X chromosome, with its hemizygosity in males, is more exposed to selection, and this can be an argument for a different introgression history (in a prolonged speciation process) or for selection generally affecting the X chromosome more in the HC ancestor, thus reducing the effective population size more than on the autosomes. The observed fraction of 80% in state HC1 is expected when the effective population size of the X chromosome is approximately 35% that of the autosomes. Explicit modeling of introgression processes and the natural selection on the X chromosome, together with an extended coal-HMM (that explicitly models the length distribution of segments supporting different genealogies), may provide the means to test among these alternative explanations when more data become available.

Another feature of the human genome that can be explored using the coal-HMM is the evolutionary scale of variation in recombination rate. We observe a correlation between the region-wide recombination rate estimated from pedigrees and the average length of segments in state HC1, consistent with evolutionary conservation of these region-wide recombination rates. However, we see no clear correlation between the fine-scale recombination rate estimated from human polymorphism data and transitions to and from state HC1. We interpret this as additional evidence that recombination hot spots are very transient, since we are analyzing an even shorter time scale (at least half) than that used when comparing human and chimpanzee, where recombination hot spots do not appear to be shared [

When more data become available, the coal-HMM can be extended to investigate more specific hypotheses. Our simulations show that the coal-HMM provides a reasonable approximation to the much more complex coalescent-with-recombination process. Furthermore, we found that adding more parameters in the transition probability matrix did not improve the fit significantly with 1.9 million base pairs. With a 1,000-fold increase in data soon to arrive, a natural extension is to introduce more hidden states in the HMM to provide a more detailed approximation of the different coalescent times. When this is done we expect a better fit of segment sizes to the geometric distribution assumed by the coal-HMM. Importantly, having more data will make it possible to investigate changes in ancestral effective population sizes along the genome, thus making it possible to infer cases of ancestral selection in the HC ancestor and in the HCG ancestor, and opening up a promising field of ancestral species population genetics that can complement analyses based on _{HC} and _{HCG} have changed through time.

Coal-HMMs provide the framework for analyses of genome alignments of human, gorilla, chimpanzee, and orangutan sequences. Coal-HMMs are similar to phylogenetic HMMs [

The transitions between the hidden states are modeled using a Markov chain with transition probability matrix

The stationary distribution of the Markov chain is
_{1}, _{2}, and _{3}, where _{1} is the probability of making a transition between HC2 and HG, _{2} the transition probability between HC2 and CG, and _{3} the transition probability between HG and CG. However, such extended models did not improve the fit significantly with the present amount of available data.

Let _{1},…,_{L}^{e}_{i}_{i}_{i}_{i}_{A}_{G},π_{C},π_{T}), where _{A} _{T} and _{G} _{C}. We calibrate the rate matrix such that branch length corresponds to expected substitutions per site. The branch lengths are

Left: Coal-HMM and coalescent parameters in state HC1. Right: Coal-HMM and coalescent parameters in state HC2. In both states we assume a molecular clock.

Let _{i}_{-1}, _{i}_{η}_{i}_{-1}, _{i}_{η}_{1},…,_{L}

The likelihood

The derivation of the relation between parameters in the coal-HMM and the coalescent process with recombination is carried out in two steps, corresponding to the left and right illustrations in

First, consider the situation in the left half of _{HC} of two given lineages in the HC ancestor, and the coalescence time _{HCG} of two lineages in the HCG ancestor. The two coalescence times _{HC} and _{HCG} are independent and exponentially distributed with means _{HC} _{HC}_{HCG} _{HCG}_{HC} and _{HCG} are the effective population sizes in the HC and HCG ancestral populations and _{2}/θ_{HC}. From this observation we also obtain

In state HC1, we therefore obtain the following relations between the coal-HMM parameters (_{1},_{2},_{HC},_{HCG}):

Second, we consider states HC2, HG, and CG. The situation is depicted in the right part of _{HC} _{2}, the coalescence of any two of the human, chimpanzee, and gorilla sequences is equally likely, and therefore we obtain similar equations as below for states HG and CG. Let _{HCG,3} be the time to coalescence of any two given lineages when three lineages are present in the HCG ancestor. Standard coalescent theory says that _{HCG,3} is exponentially distributed with mean _{HCG}/3. We now obtain the following two equations

Note that there are five parameters _{1},_{2},_{HC},_{HCG}) in equations 1–5. Thus, there is a constraint on the parameters in the coal-HMM. Subtracting equations 4 and 5 we get

To identify the parameters we solve the system of equations 1–3 and 5. From equation 1 we obtain

When reporting the parameters, we scale (

We assume the branch lengths fulfil the molecular clock constraint

In order to validate the coal-HMM approximation to the coalescent process with recombination, we conducted a simulation study. The parameters in the simulated coalescent process with recombination are _{HC} = _{HCG} _{H} = _{C} _{G} _{1} of HC is 4 Myr, speciation time _{2} of HCG is 5.5 Myr, the time to the ancestor of all four sequences is 18 Myr, the generation time of individuals is 25 y, the length of the sequence is 500,000 bp, and the recombination rate is

_{HC} is estimated with much larger variance than _{HCG}, in agreement with analyses of the real data.

Summary of Simulation Study

One of the assumptions of the coal-HMM is that the distribution of fragment lengths for each hidden state is geometric. Considering simulations from a coalescence with recombination process, the fragments for the coal-HMM are aggregations of fragments with branch lengths corresponding to the same state of the coal-HMM.

The distribution of fragment lengths is reasonably well approximated by the geometric distribution (blue line). This property is a basic assumption of the coal-HMM.

(Bottom) Site information without outgroup: sites shared by HC in red, by HG in blue, and by CG in green (see first, third, and last columns in

(Second from bottom) True genealogical state in simulations: state HC1 in red, HC2 in dark red, HG in blue, and CG in green.

(Third from bottom) Site information with outgroup: sites supporting state HC1 or HC2 in red, HG in blue, and CG in green (see second, third, and last columns in

(Top) Posterior probabilities for the four states. Coloring as in second from bottom.

Chimpanzee–gorilla–orangutan sequence data from Targets 1 (Chromosome 7), 106 (Chromosome 20), 121 (Chromosome 2), and 122 (Chromosome 20) were obtained from the NIH Intramural Sequencing Center Web site (

(204 KB PDF)

(50 KB PDF)

(52 KB PDF)

(26 KB PDF)

(26 KB PDF)

(57 KB PDF)

(291 KB PDF)

We are grateful to Thomas Bataillon, Frank G. Jørgensen, Jeff Thorne, and Marcy Uyenoyama for fruitful discussions at various stages of this project and valuable comments on the manuscript. Peter K. A. Jensen contributed valuable knowledge on the state of the fossil evidence. We would also like to thank associate editor Nick Patterson, Jun Liu, and two anonymous reviewers for various comments and suggestions that helped improve the manuscript.

chimp–gorilla

coalescent hidden Markov model

human–chimp

human–chimp–gorilla

human–chimp–gorilla–orangutan

human–gorilla

hidden Markov model

million years