^{1}

^{*}

^{2}

^{3}

^{3}

^{3}

^{4}

^{4}

^{3}

^{5}

^{*}

Performed the experiments: YM SR CW BG MR RS. Analyzed the data: NE LP NB. Wrote the paper: NE LP NB.

The authors have declared that no competing interests exist.

The diversity of virus populations within single infected hosts presents a major difficulty for the natural immune response as well as for vaccine design and antiviral drug therapy. Recently developed pyrophosphate-based sequencing technologies (pyrosequencing) can be used for quantifying this diversity by ultra-deep sequencing of virus samples. We present computational methods for the analysis of such sequence data and apply these techniques to pyrosequencing data obtained from HIV populations within patients harboring drug-resistant virus strains. Our main result is the estimation of the population structure of the sample from the pyrosequencing reads. This inference is based on a statistical approach to error correction, followed by a combinatorial algorithm for constructing a minimal set of haplotypes that explain the data. Using this set of explaining haplotypes, we apply a statistical model to infer the frequencies of the haplotypes in the population via an expectation–maximization (EM) algorithm. We demonstrate that pyrosequencing reads allow for effective population reconstruction by extensive simulations and by comparison to 165 sequences obtained directly from clonal sequencing of four independent, diverse HIV populations. Thus, pyrosequencing can be used for cost-effective estimation of the structure of virus populations, promising new insights into viral evolutionary dynamics and disease control strategies.

The genetic diversity of viral populations is important for biomedical problems such as disease progression, vaccine design, and drug resistance, yet it is not generally well understood. In this paper, we use pyrosequencing, a novel DNA sequencing technique, to reconstruct viral populations. Pyrosequencing produces DNA sequences, called reads, in numbers much greater than standard DNA sequencing techniques. However, these reads are substantially shorter and more error-prone than those obtained from standard sequencing techniques. Therefore, pyrosequencing data requires new methods of analysis. Here, we develop mathematical and statistical tools for reconstructing viral populations using pyrosequencing. To this end, we show how to correct errors in the reads and assemble them into the different viral strains present in the population. We apply these methods to HIV-1 populations from drug-resistant patients and show that our techniques produce results quite close to accepted techniques at a lower cost and potentially higher resolution.

Pyrosequencing is a novel experimental technique for determining the sequence of DNA bases in a genome

In this paper we address computational issues that arise in applying this technology to the sequencing of an RNA virus sample. Within-host RNA virus populations consist of different haplotypes (or strains) that are evolutionarily related. The population can exhibit a high degree of genetic diversity and is often referred to as a quasispecies, a concept that originally described a mutation-selection balance

Pyrosequencing of a virus population produces many reads, each of which originates from exactly one—but unknown—haplotype in the population. Thus, the central problem is to reconstruct from the read data the set of possible haplotypes that is consistent with the observed reads and to infer the structure of the population, i.e., the relative frequency of each haplotype.

Here we present a computational four-step procedure for making inference about the virus population based on a set of pyrosequencing reads (

Sequence reads are first aligned to a reference strain, then corrected for errors, and assembled into haplotype candidates. Finally, the relative frequencies of the reconstructed haplotypes are estimated in a ML fashion. These estimates constitute the inferred virus population.

The alignment step of the proposed procedure is straightforward for the data analyzed here and has been discussed elsewhere

These datasets consist of approximately 5000 to 8000 reads of average length 105 bp sequenced from a 1 kb region of the

Estimating the viral population structure from a set of reads is, in general, an extremely hard computational problem because of the huge number of possible haplotypes. The decoupling of error correction, haplotype reconstruction, and haplotype frequency estimation breaks this problem into three smaller and more manageable tasks, each of which is also of interest in its own right. The presented methods are not restricted to RNA virus populations, but apply whenever a reference genome is available for aligning the reads, the read coverage is sufficient, and the genetic distance between haplotypes is large enough. Clonal data indicates that the typical variation in the HIV

The

The problem of estimating the population structure from sequence reads is similar to assembly of a highly repetitive genome

More generally, the problem of estimating diversity in a population from genome sequence samples has been studied extensively for microbial populations. For example, the spectrum of contig lengths has been used to estimate diversity from shotgun sequencing data

We have developed a computational and statistical procedure for inferring the structure of a diverse virus population from pyrosequencing reads. Our approach comprises four consecutive steps (

Given the high error rate of pyrosequencing, error correction is a necessary starting point for inferring the virus population. The errors in pyrosequencing reads typically take the form of one-base indels along with substitutions and ambiguous bases and occur most often in homopolymeric regions. The reads come with quality scores for each base quantifying the probability that the base is correct.

Error rates with the Roche GS20 system have been estimated as approximately 5 to 10 errors per kb

Due to our assumption that there are no haplotypes with insertions in the population, the insertions in the reads can all be simply corrected through alignment with the reference genome. We do not deal with the problem of alignment here; it is straightforward because our assumption of the existence of a reference genome implies that only pair-wise alignment is necessary. In the remainder of the paper, we assume that a correct alignment is given, leaving about 1 error per kb to correct (or 3 errors per kb if the low-quality reads are not aggressively pruned).

Our approach for error correction resembles the method of

Fixed-width windows (shown as the dashed box) over the aligned reads are considered. Two different types of reads are depicted (light versus dark lines), indicating their origin from two different haplotypes. Genetic differences (indicated by circles and squares) provide the basis for clustering reads into groups representing the respective haplotypes. After clustering, errors (marked as crosses) can be corrected.

The statistical testing procedure consists of two steps. First, every column in the window is tested for over-representation of a mutation using a binomial test. Second, every pair of columns is tested to see if a pair of mutations happens together more often than would be expected by chance. See

Any significant over-representation of a mutation or a pair in a window is regarded as evidence for the reads originating from more than one haplotype. The testing procedure produces an estimate for the number of haplotypes in the window as follows. First, all single mutations are tested for significance; each significant mutation gives evidence for another haplotype in the window. Next, all pairs of mutations are tested; any significant pairs is evidence for another haplotype. However, this process can over-count the number of haplotypes in the window in certain cases if two mutations are significant both by themselves and as a pair. In this case, we correct for the over-count (see

We then separate the reads into

Input: A window of aligned reads

Output: The

Procedure:

Find all candidate mutations and pairs of mutations and test for overrepresentation.

Count the number of non-redundant mutations and pairs that are significant. This is the number

Cluster the reads in the window into

Output corrected reads.

Applying a parsimony principle the algorithm finds the smallest number

The error correction procedure can lead to uncorrected errors or miscorrections via false positives and negatives (leading to over/underestimation of the number of haplotypes in a window) or misclustering. See

False positives arise when an error is seen as a significant variant; they will be consequences of setting the error rate too low or the significance level too high, or if errors are highly correlated. Misclustering can happen if errors occur frequently enough on a single read to make that read appear closer to an incorrect haplotype. This likelihood is increased as the window size grows and more reads overlap the window only partially.

An analysis of the false negative rate gives an idea of the theoretical resolution of pyrosequencing. False negatives arise when a true variant tests as non-significant and thus is erased. If the input data were error-free, this would be the only source of mistaken corrections and would happen by eliminating rare variants. Given an error rate of 2.5 errors per kb, the calculation in the

Our approach to haplotype reconstruction rests on two basic beliefs. First, the haplotypes in the populations should not exhibit characteristics that are not present in the set of reads. This means that every haplotype in the population should be realizable as an overlapping series of reads. Second, the population should explain as many reads as possible with as few haplotypes as possible.

We assume a set _{R}_{R}

The _{R}_{irred}, _{R}_{1} to an irredundant read _{2}, if

_{1} starts before _{2} in the genome,

_{1} and _{2} agree on their (non-empty) overlap, and

there would not be a path in _{R}_{1} to _{1} without this edge.

Finally, edges are added from the source

A path in the read graph from the source to the sink corresponds to a haplotype that is completely consistent with _{R}

For example, in

A simplified genome of length

Displayed are three read graphs of 1000 reads each derived from populations of 5 haplotypes at 3% (A), 5% (B), and 7% diversity (C). The bottom five lines in the graph correspond to reads which match the five haplotypes uniquely; the top line in subfigures (A) and (B) contains reads which match several haplotypes. In each subfigure, the reads are colored according to a single chain decomposition.

We say that a set of haplotypes _{R}_{R}

The set of haplotypes completely consistent with a set of reads is an explaining set for these reads if, and only if, every vertex of the read graph lies on a directed path from the source to the sink.

The Lander–Waterman model of sequencing is based on the assumptions that reads are random (uniformly distributed on a genome) and independent ^{−c})^{n}^{−c,ρ})^{n}

If the condition of the proposition is violated, we can remove the violating set of reads to obtain a new set satisfying the condition. This amounts to discarding reads that either contain mistakes in the error correction or come from haplotypes that are at a too low frequency in the population to be fully sequenced. Thus the resolution is inherently a function of the number of reads.

We are now left with finding a minimal explaining set of completely consistent haplotypes. Restricting to this subset of haplotypes reduces the computational demand of the problem significantly. The proposition implies that an explaining set of completely consistent haplotypes is precisely a set of paths in the read graph from the source to the sink, such that all vertices of the read graph are covered by at least one path. We call such a set of paths a

Every minimal cover of the read graph has the same cardinality, namely the size of the largest set

A minimal cover of the read graph can be computed by solving a maximum matching problem in an associated bipartite graph. This matching problem can be solved in time at worst cubic in the number of irredundant reads.

The minimal path cover obtained from the maximum matching algorithm is in general not unique. First, it provides a minimal chain decomposition of the graph. A

Input: A set

Output: A minimal set of explaining haplotypes for

Procedure:

Construct the read graph _{R}

Compute a minimal chain decomposition of the read graph.

Extend the chains in the graph to paths from the source to the sink in _{R}

Output the set of haplotypes corresponding to the paths found in step 3.

The algorithm can easily be modified to produce a non-minimal set by constructing multiple chain decompositions and by choosing multiple ways to extend a chain to a path. We note that the set of all paths in the graph is generally much too large to be useful. For example, the HIV datasets give rise to up to 10^{9} paths and in simulations we often found over 10^{12} different paths in the graph. Generating paths from minimal explaining sets is a reasonable way of sampling paths, as we will see below when discussing simulation results (see also

Finally, if the conditions of the proposition are not satisfied, i.e., if the coverage is too low and the set of completely consistent haplotypes does not contain an explaining set, then that condition can be relaxed. This corresponds to modifying the read graph by adding edges between all non-overlapping reads. Algorithm 2 will then again find a minimal set of explaining haplotypes.

A virus population is a probability distribution on a set of haplotypes. We want to estimate this distribution from a set of observed reads. Let ^{R}_{r}

Our inference is based on a statistical model for the generation of sequence reads from a virus population. Similar models have been used for haplotype frequency estimation _{h}_{h}_{∈H}. Second, a read

The virus population is represented by five genomes (top) of two different haplotypes (light versus dark lines). The probability distribution is

Let

We estimate

We have simulated HIV populations of different diversities and then generated reads from these populations by simulating the pyrosequencing procedure with various error rates and coverage depths. The first 1 kb of the HIV

The simulations show that Algorithm 1 reduces the error rate by a factor of 30. This performance is largely independent of the number of haplotypes in the population (

In order to assess the ability of Algorithm 2 to reconstruct 10 haplotypes from 10,000 error-free reads (yielding about 1500 irredundant reads) we generate increasing numbers of candidate haplotypes. This is achieved by repeatedly finding a minimal set of explaining haplotypes until either we reach the desired number of haplotypes or we are unable to find more haplotypes that are part of a minimal explaining set.

^{10} total paths, the strategy of repeatedly finding minimal sets of explaining haplotypes is very efficient for haplotype reconstruction.

Up to 300 candidate haplotypes were generated using Algorithm 1 from 10,000 error free reads drawn from populations of size 10 at varying diversity levels. Displayed are two measures of the efficiency of haplotype reconstruction: the percent of the original haplotypes with exact matches among the reconstructed haplotypes (A), and the average Hamming distance (in amino acids) between an original haplotype and its closest match among the reconstructed haplotypes (B).

The performance of the EM algorithm for haplotype frequency estimation described above is measured as the Kullback–Leibler (KL) divergence between the original population

Haplotype frequencies were inferred using both the EM algorithm (circles) and a simple heuristic algorithm (triangles); the resulting distance from the correct frequencies is measured using KL divergence. Error bars give the interquartile range over 50 trials. The populations consisted of 10 haplotypes at equal frequency and 5% diversity. The input to the algorithms was a set of reads simulated from the population and the original 10 haplotypes.

In order to test the combined performance of the haplotype reconstruction and frequency estimation, our basic measure of performance is the proportion of the original population that is reconstructed within 10 amino acid differences. This measure, which we call φ_{10}, is defined as follows (see also

For the first simulation of combined performance, we consider error-free reads from populations consisting of between 5 and 100 haplotypes, each with equal frequency, at diversities between 3 and 8%. We simulated 10,000 error-free reads of average length 100 from these populations and ran haplotype reconstruction and frequency estimation.

The proportion of the population reconstructed within 10 amino acid differences (ϕ_{10}, “proportion close”) is shown. Here 10,000 error-free reads were sampled from populations with diversity between 3 and 8% and with between 5 and 100 haplotypes of equal frequency. Haplotypes were reconstructed and then frequencies were estimated.

For the second combined test, we tested all three steps: error correction, haplotype reconstruction, and frequency estimation. In order to model the miscorrection of errors, we ran ReadSim

Proportion of population reconstructed within 10 amino acid differences (ϕ_{10}, “Proportion close”) using haplotype reconstruction and frequency estimation. The original populations had 10 haplotypes of equal frequency at varying levels of diversity. Error was randomly introduced in the simulated reads to mimic various levels of error correction.

Our second evaluation of population reconstruction is based on ultra-deep sequencing of drug-resistant HIV populations from four infected patients _{1}, which indicates the percent of the inferred population that matches a clonal sequence within one amino acid difference. This is used instead of φ_{10} used before in order to provide a more sensitive performance measure. In all samples, at least 51.8% of the inferred populations were within one amino acid difference of a clonal haplotype. Based on the present data, we cannot decide whether the additional inferred haplotypes went undetected by the Sanger sequencing, or if they are false positives of the reconstruction method.

Pyrosequencing | Haplotype reconstruction | Comparison to clonal seq. | ||||||||

Sample | Reads | Irred. | ρ_{99} | Gaps/kb | Err/kb | Min. cov. | Diversity | Clones | Avg. dist. | φ_{1} |

V11909 | 5177 | 641 | 2.2 | 3.3 | 1.10 | 22 | 15.8 | 65 | 1.81 | 51.8 |

V54660 | 7777 | 228 | 1.5 | 2.3 | 1.67 | 4 | 1.0 | 32 | 0.34 | 99.6 |

V3852 | 4854 | 227 | 2.4 | 3.4 | 1.33 | 7 | 1.4 | 42 | 0.29 | 100.0 |

V2173 | 6304 | 354 | 1.8 | 2.3 | 1.31 | 4 | 2.3 | 26 | 0.81 | 86.6 |

The first four columns describe the pyrosequencing data: the number of reads, the number of irredundant reads, the expected frequency (in percent) of the least frequent haplotype we can expect to cover with 99% confidence, and the number of gaps per kb in the aligned data. The next three columns describe the reconstruction algorithm: the number of non gap characters changed in error correction, the size of a minimal explaining set of haplotypes, and the diversity, measured as the expected number of amino acid differences among the estimated population. After error correction, reads were translated into amino acids. The last three columns describe the validation using (translated) clonal sequences: the number of clones sequenced, the average distance between the estimated population, and the closest Sanger haplotype, and φ_{1}, the percentage of the estimated population that was close (up to 1 amino acid difference) to a clone.

We found many additional haplotypes in our analysis of the most complex sample, V11909.

Frequency | ||

Sanger | Pyro | Mutations |

52.3 | 19.3 | M46I, I54V, G73I, I84V, L90M |

12.3 | 19.0 | M46I, I54V, G73S, I84V, L90M |

9.2 | 9.4 | M46I |

6.2 | 5.6 | |

4.6 | 7.1 | M46I, I54V, G73S, L90M |

4.6 | 1.9 | M46I, I54V, G73I, L90M |

3.1 | 5.8 | M46I, G73I, I84V, L90M |

3.1 | 0.0 | L33F, M46I, I54V, G73S, I84V, L90M |

1.5 | 1.9 | M46I, L90M |

1.5 | 0.0 | L33F, M46I, I54V, G73I, I84V, L90M |

1.5 | 0.0 | M46I, I54V, G73N, I84V, L90M |

0.0 | 4.9 | M46I, G73I |

0.0 | 4.7 | M46I, I84V |

0.0 | 4.0 | M46I, G73S, I84V, L90M |

0.0 | 3.1 | M46I, I54V, G73S, V82I, I84V, L90M |

0.0 | 2.9 | M46I, I54V, G73I, I84V |

0.0 | 2.9 | M46I, I50V, I54V, G73I, I84V, L90M |

0.0 | 2.0 | I84V |

0.0 | 1.4 | I54V, G73I, I84V, L90M |

0.0 | 1.2 | M46I, I50V, I54V, G73S, L90M |

0.0 | 1.1 | M46I, I54V |

0.0 | 1.0 | M46I, I50V, I84V, L90M |

0.0 | 0.5 | M46I, I54V, G73I |

0.0 | 0.3 | G73S, I84V, L90M |

Displayed are the patterns of resistance mutation for the 65 Sanger sequences and the estimated population for sample V11909. Mutation patterns were restricted to 15 positions in the protease (amino acids 23, 24, 30, 32, 33, 46, 48, 50, 53, 54, 73, 82, 84, 88, and 90) associated with PI resistance.

Pyrosequencing constitutes a promising approach to estimating the genetic diversity of a community. However, sequencing errors and short read lengths impose considerable challenges on inferring the population structure from a set of pyrosequencing reads. We have approached this task by identifying and solving consecutively three computational problems: error correction, assembly of candidate haplotypes, and estimation of haplotype frequencies. Our methods focus on the situation where a reference genome is available for the alignment of reads. This is the case, for example, for many important pathogens, such as bacterial and viral populations.

The procedure consists of three steps. First, error correction is performed locally. We take windows of fixed width over the aligned reads and cluster reads within the windows in order to resolve the local haplotype structure. This approach is based on previous methods

Haplotype reconstruction is based on two assumptions: consistency and parsimony. We require that each haplotype be constructible from a sequence of overlapping reads and that the set of explaining haplotypes be as small as possible. The Lander–Waterman model of sequencing implies lower bounds on the number of reads necessary to meet the first requirement. The fundamental object for haplotype reconstruction is the read graph. A minimal set of explaining haplotypes corresponds to a minimal path cover in the read graph, and this path cover can be found efficiently using combinatorial optimization. Moreover, the cardinality of the minimal path cover is an important invariant of the haplotype reconstruction problem related to the genetic diversity of the population.

We believe that these methods are also applicable to many metagenomic projects. In such projects, estimation of the diversity of a population is a fundamental question. The size of a minimal cover of a fragment assembly graph provides an intuitive and computable measure of this diversity.

We have validated our methods by extensive simulations of the pyrosequencing process, as well as by comparing haplotypes inferred from pyrosequencing data to sequences obtained from direct clonal sequencing of the same samples. Our results show that pyrosequencing is an effective technology for quantitatively assessing the diversity of a population of RNA viruses, such as HIV.

Resistance to most antiretroviral drugs is caused by specific mutational patterns comprising several mutations, rather than one single mutation. Thus, an important question that can be addressed efficiently by pyrosequencing is which of the resistance mutations actually occur on the same haplotype in the population. Since our methods avoid costly clonal sequencing of the HIV populations for determining the co-occurrence of HIV resistance mutations

The sample size of approximately 10,000 reads we have considered provides us with the opportunity of detecting variants present in only 1% of the population. Pyrosequencing can produce 200,000 reads and thus twenty populations could be sequenced to a good resolution using a process less labor intensive than a limiting dilution clonal sequencing to a similar resolution of a single population.

The simulations suggest that the method works best with populations that are suitably diverse. Intuitively, the information linking two reads together on the same haplotype decays rapidly in sections of the genome where there are few identifying features of that haplotype (as in a region of low diversity). In particular, repeats of sufficient length in the reference genome can completely destroy linkage information. However, at some point the benefits of increased diversity will be partially reduced by the increased difficulty of the alignment problem. With more diverse populations or true indels, alignment to single reference genome will become less accurate.

The HIV

Since our computational procedure produces an estimate of the entire virus population, it allows the study of fundamental questions about the evolution of viral populations in general

In addition to the promising biological applications, there are many interesting theoretical questions about reconstructing populations from pyrosequencing data. The errors in pyrosequencing reads tend to be highly correlated, as they occur predominately in homopolymeric regions. While this can make correction more difficult (a fact which can be counteracted by the use of quality scores), we believe that it can make haplotype reconstruction more accurate than if the errors were uniform. If errors are isolated to a few sites in the genome, fewer additional explaining haplotypes are needed than if the errors were distributed throughout. The exact relationship between the error process of pyrosequencing, error correction, and haplotype reconstruction is worthy of further study.

As pyrosequencing datasets can contain 200,000 reads, it is worthwhile to investigate how our methods scale to such large datasets. Haplotype reconstruction is the only step that is not immediately practical on such a large number of reads, since it is at worst cubic in the number of irredundant reads. However, problems of this size are approachable with our methods as follows.

The theoretical resolution of the algorithms depends on two factors: first, the ability to differentiate between errors and rare variants; and second, whether there are enough reads so that we can assemble all haplotypes. We have seen that the number of reads necessary for assembly scales with the inverse of the desired resolution: if

Number of reads | 1,000 | 5,000 | 10,000 | 50,000 | 100,000 | 200,000 |

3.00 | 1.20 | 0.90 | ||||

0.23 | 0.12 | 0.06 |

Displayed are the resolution of error correction (binomial test only) and of haplotype reconstruction as a function of the number of reads. The resolution of the error correction is defined as the smallest frequency of a mutation that will be visible over the background error rates; it is calculated with error rate ε = 0.0025 and significance level α = 0.001. The resolution of the haplotype reconstruction (derived from the Lander–Waterman model) is the smallest haplotype frequency expected to be entirely covered by reads. For small read sizes, haplotype reconstruction is the limiting factor (underlined in the table) but for over approximately 35,000 reads, error correction is the limiting factor.

The limited resolution of error correction combined with the elimination of redundant reads makes haplotype reconstruction feasible for large datasets. For example, error correction on 200,000 reads with ε = 0.0025 and α = 0.001 will erase all variants with frequency below 0.365 (

Current and future improvements to pyrosequencing technology will lead to longer reads (250 bp), more reads, and lower errors. However, in order for huge numbers of reads to be of great help in the ultra-deep sequencing of a population, the error rates must also decrease. The performance of our methods as read length varies is an important question, given the availability of sequencing technologies with different read lengths (e.g., Solexa sequencing with 30 to 50 bp reads) and the desire to assemble haplotypes of greater size (e.g., the entire 10 kb HIV genome).

Notice that haplotype reconstruction seems to be quite good locally (

We use two statistical tests for locally detecting distinct haplotypes. The first test analyzes each column of the multiple alignment window. We write

Next, we test pairs of mutations in two different alignment columns _{u}_{v}

The procedure tests all columns in the window using the binomial test and then all pairs of columns using Fisher's exact test. This can lead to over-counting of the number of haplotypes as follows. Suppose that in columns 1 and 2 the consensus base is A, but that there is a mutation C in some of the reads in each column. If both mutations are significant by themselves, this is evidence of three haplotypes in the window. If they are also significant together, this would be evidence of four haplotypes. However, there could be only two true haplotypes at these two positions: AA and CC. To correct for this, we subtract two from the count whenever two significant mutations are significant together and always occur on exactly the same set of reads.

We do not explicitly address the multiple comparisons problem associated with this testing procedure here and regard the significance levels of the tests as parameters of Algorithm 1. We account for the quality scores associated with each base by using (rounded) weighted counts in the test statistics. Gaps are treated as unknown bases and represented by a special character with quality score zero. We found that an error rate of 0.0025, a p-value of 0.001, and a window size of 24 provided the best error correction. These parameters can be tuned as follows. First, the window size should be chosen to best help the clustering. A large window provides more power since there are more identifying mutations, but also can be more difficult to cluster since many reads will only partially overlap the window.

Next, the p-value for the tests and the error rate should be adjusted to prevent false positives and negatives. The number of mutations required in a column before the mutation is considered significant can be calculated from the binomial distribution above. For example, with 10,000 reads of length 100 in a genome of length 1000, there will be approximately

Notice that the power of the error correction grows very slowly (

Suppose the read graph _{R}

_{R}_{irred}. Part (1) is then a direct application of Dilworth's theorem _{irred}. There is an edge between _{R}_{R}

For the time complexity, notice that building the read graph _{R}^{2}). Building the associated bipartite graph is equivalent to finding the transitive closure of the read graph and thus is ^{2}), the matching algorithm takes time ^{5/2}). Depending on the structure of the graph, either the transitive closure or matching problems can dominate, but both are of complexity ^{3}).

We use an EM algorithm _{rh}_{hid} yields

Our starting point is the first 1 kb of the wild type sequence of the HIV

We report the expected value of the Hamming distance between two haplotypes drawn from a population as our basic measure of the diversity of a population. This statistic, which we call simply “diversity” can be thought of as a version of the Simpson measure

We use ReadSim

For the simulations of haplotype reconstruction, we generate pyrosequencing reads using the model described above and illustrated in _{h}^{−6} to zero.

We also test a simple alternative to the EM algorithm as follows. For each haplotype _{h}_{KL} (_{h}_{∈H} _{h}_{h}_{h}_{n}

Virus populations derived from four treatment-experienced patients between 2000 and 2005 were sequenced using both pyrosequencing and limiting dilution Sanger sequencing. The plasma HIV-1 RNA levels in the four plasma samples were each greater than 100,000 copies/ml as determined using the VERSANT HIV-1 RNA assay

Briefly, ultra-deep pyrosequencing was performed on four RT-PCR products from RNA extracted from cryopreserved plasma samples. The median number of cDNA copies prior to sequencing was 100 with an interquartile range of 75 to 180. The resulting datasets consisted of between 4854 and 7777 reads of average length 105 bp. Reads were error corrected (Algorithm 1) and translated to amino acids. For haplotype reconstruction, Algorithm 2 was run repeatedly until all or at most 10,000 candidate haplotypes were found. The samples were translated into amino acids after the error correction step; thus, the haplotype reconstruction and frequency estimation algorithms are done on the amino acid level.

For the sample with the greatest diversity (V11909), the unamplified cDNA product was serially diluted prior to PCR amplification. Bidirectional sequencing was performed directly on 37 amplicons derived from the 1/30 cDNA dilutions and 31 amplicons derived from the 1/100 cDNA dilutions. Three sequences were discarded because of incomplete coverage. For the other three samples, we used the Sanger method to sequence a total of 32, 42, and 26 plasmid subclones per sample. Some of the sequences obtained from limiting dilutions contained mixtures of several clones. In this case, in order to measure the Hamming distance between an inferred haplotype and a clonal haplotype with ambiguous bases, we used the minimum distance over all possible translations of the ambiguous haplotype.

Displayed are two different chain decompositions of the read graph for 1000 reads from a population of 5 haplotypes at 3% diversity. The bottom five lines correspond to reads matching a haplotype uniquely; the top to reads matching several haplotypes. One decomposition gets one haplotype entirely correct (top, black); the other gets two different haplotypes essentially correct (bottom, green and yellow). In this way, taking multiple chain decompositions allows us to reconstruct all haplotypes.

(0.84 MB EPS)

Up to 1000 candidate haplotypes were generated using Algorithm 2 from 10,000 error free reads drawn from populations of size 10, 20, and 50 (subfigures (A), (B), and (C), respectively) at varying diversity levels. Displayed is a measure of the efficiency of haplotype reconstruction: the average Hamming distance (in amino acids) between an original haplotype and its closest match among the reconstructed haplotypes.

(0.01 MB EPS)

Shown is the resulting error after error correction on populations with 4% diversity. Populations with up to 50 haplotypes of equal frequency were created. The program ReadSim was used to simulate pyrosequencing with an error rate of 3 to 6 errors per kb (after alignment). Error correction successfully reduced the error rate by a factor of approximately 30.

(0.02 MB EPS)

Displayed is the computed lower bound on the population size from simulations with varying error rates and numbers of reads. Population diversity ranged from 3 to 7%. The lower bound is computed as the minimal size of a cover of the read graph. Error bars give interquartile ranges over 100 trials at different diversity levels. This estimated lower bound is quite accurate for error free reads; it seems to increase linearly with the number of reads if errors are introduced.

(0.02 MB EPS)

^{5/2}algorithm for maximum matchings in bipartite graphs.