^{1}

^{*}

^{2}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: EMV SDWF. Performed the experiments: EMV SDWF. Analyzed the data: EMV SDWF. Contributed reagents/materials/analysis tools: EMV SDWF. Wrote the paper: EMV SDWF.

Identifying the source of transmission using pathogen genetic data is complicated by numerous biological, immunological, and behavioral factors. A large source of error arises when there is incomplete or sparse sampling of cases. Unsampled cases may act as either a common source of infection or as an intermediary in a transmission chain for hosts infected with genetically similar pathogens. It is difficult to quantify the probability of common source or intermediate transmission events, which has made it difficult to develop statistical tests to either confirm or deny putative transmission pairs with genetic data. We present a method to incorporate additional information about an infectious disease epidemic, such as incidence and prevalence of infection over time, to inform estimates of the probability that one sampled host is the direct source of infection of another host in a pathogen gene genealogy. These methods enable forensic applications, such as source-case attribution, for infectious disease epidemics with incomplete sampling, which is usually the case for high-morbidity community-acquired pathogens like HIV, Influenza and Dengue virus. These methods also enable epidemiological applications such as the identification of factors that increase the risk of transmission. We demonstrate these methods in the context of the HIV epidemic in Detroit, Michigan, and we evaluate the suitability of current sequence databases for forensic and epidemiological investigations. We find that currently available sequences collected for drug resistance testing of HIV are unlikely to be useful in most forensic investigations, but are useful for identifying transmission risk factors.

Molecular data from pathogens may be useful for identifying the source of infection and identifying pairs of individuals such that one host transmitted to the other. Inference of who acquired infection from whom is confounded by incomplete sampling, and given genetic data only, it is not possible to infer the direction of transmission in a transmission pair. Given additional information about an infectious disease epidemic, such as incidence of infection over time, and the proportion of hosts sampled, it is possible to correct for biases stemming from incomplete sampling of the infected host population. It may even be possible to infer the direction of transmission within a transmission pair if additional clinical, behavioral, and demographic covariates of the infected hosts are available. We consider the problem of identifying the source of infection using HIV sequence data collected for clinical purposes. We find that it is rarely possible to infer transmission pairs with high credibility, but such data may nevertheless be useful for epidemiological investigations and identifying risk factors for transmission.

Phylogenetic trees reconstructed from sequences of pathogens contain information on the past transmission dynamics that would be difficult, if not impossible, to obtain through other means. Over the past two decades, a number of approaches have been proposed to extract epidemiologically relevant information from viral phylogenies, particularly from highly variable RNA viruses such as HIV-1, hepatitis C virus, and influenza A virus

Although many studies have focused on the ‘phylodynamics’

Recently, there has been rapid development of methods to identify transmission sources under the assumption of complete sampling, i.e. under the assumption that every infected individual is represented in the phylogeny. These methods have yielded many valuable insights into the spread of nosocomial infections

If the host

Due to the problems involved in incomplete sampling, relatively little work has been performed to identify potential sources of infection - i.e. understanding transmission at an individual level - using population-level datasets collected for clinical or surveillance purposes. A notable exception is a study of HIV-positive men who have sex with men (MSM) in Brighton, UK

In the case of incomplete sampling, calculating the probability that a putative transmission pair is real is equivalent to calculating the probability that there are zero unsampled intermediaries between the pair in the viral phylogeny. Calculating this probability is complex, but possible, provided a realistic model of the epidemic process and given good data about incidence and prevalence of infection. This paper is concerned with calculating the probability, henceforth called the

To demonstrate the utility of infector probabilities to the analysis of real epidemic data, we have simulated a dataset based on the real HIV epidemic among MSM in Detroit, Michigan. Through a simulation-based analysis, we use our solution of the infector probabilities to address the following questions:

Is it possible to infer transmission events from HIV phylogenetic data with high accuracy?

Are widely available HIV sequence data collected for drug resistance testing useful for forensic investigations of who infected whom?

Are estimated infector probabilities useful for epidemiological investigations? Can our methods detect increased transmission rates during early/acute HIV infection (EHI) or other variables that determine heterogeneity in transmission rates?

This section is focused on the derivation of a

Our solutions employ a population genetic model that assumes that the population size is large, so the model may be biased for very small epidemics or outbreaks. In reality, all of the inputs into our solution of

Our approach makes use of coalescent theory, which is based on the retrospective modelling of gene tree structure

To give intuition for the method, we first illustrate a simple example of an epidemic within a homogeneous population. The variable

We calculate the probability that host

What is the probability that

When an unsampled individual

To demonstrate how sampling plays a central role in determining the extent to which cherries represent direct transmissions, we will consider a large sample size, such that we can model the number of cherries as well as the number of cherries that correspond to direct transmissions as ordinary differential equations. Previously

To determine the number of cherries that represent direct transmission,

Some analytical insights into how different parameters affect the proportion of cherries that are associated with direct transmission can be obtained under the assumption that the number of infected hosts,

This can be solved using separation of variables, with the constant of integration calculating by the initial condition

Substituting this solution of

Solving the above for

The solution for

Similarly, at equilibrium we can substitute

These results demonstrate that at equilibrium, the fraction of sequences in cherries is independent of sampling fraction (

The very simple expressions in

The solutions described below are applicable to a large class of infectious disease process models which describe the incidence and prevalence of infection over time. The host population is not assumed to be homogeneous, but can have arbitrary discrete structure. Each infected host can occupy any of

The discrete states that a host may occupy will be indexed by variables

The process model will be denoted by the tuple

In

The coalescent model described here is complex, so a visual aid is provided in

At some point in the past, every sampled host

Derivations of

Consider the node in

Suppose that at

The function

Suppose that at retrospective time

In matrix notation, the derivative of the vector

Suppose that at a time _{i}

Suppose that at retrospective time _{k}

This motivates the following equation for the derivative of _{i}

At an ancestral node of _{i}

Software for calculating _{ij}

We simulated HIV gene genealogies using an individual-based stochastic simulation based on the epidemic model presented in

The HIV model is illustrated in

Left: Simulated number of infections over time. Infections are aggregated by stage of infection (top) and by diagnosis status (bottom). Right: Flow-diagram showing the progression of infected individuals through 5 stages of infection, diagnosis, and death. The color of compartments correspond to diagnosis status in prevalence figures on left. The color of outlines corresponds to stage of infection in prevalence figures on left. The per-capita rate of state transitions is shown over arrows.

An essential aspect of this model is how incidence

In the discrete individual-based simulations, the time to the next transmission event is exponentially distributed with rate

Note that the the simulation may be put in the canonical form

To reconstruct a gene genealogy from the simulation, we iteratively build a binary tree by adding a new branch at each transmission. The logic underlying tree reconstruction is given in

In reality, we do not observe the complete transmission genealogy, but rather a small subsample. To model sampling, we randomly sampled

As noted above, the calculation of

To simulate genetic sequence alignments corresponding to the simulated genealogical relationships described in the previous section, we used the program Seq-Gen v.1.3.3

For each sequence alignment, we used relaxed-clock Bayesian methods

Results for the HIV model presented below which are based on a true transmission genealogy utilize 20 independent simulations. Results that utilize simulated sequences are based on only a single simulation, but utilize 50 posterior sampled trees.

Define

Code for all simulation experiments can be found at

Synthetic HIV datasets were generated which matches the data described previously in

The number of HIV infections over time are shown in

Infector probabilities for the HIV simulations are shown in

On the left, infector probabilities are calculated for the true transmission genealogy in 20 independent simulated HIV epidemics and samples of 662 individuals. On the right, infector probabilities are based on simulated sequence data for a single simulation and a sample of 662 individuals. Data are pooled from 50 trees sampled from the Bayesian phylogenetic posterior distribution. Middle: The estimated infector probabilities (x-axis) versus whether a transmission actually occured (hash marks) for all pairs of sampled individuals in the HIV simulation. The red line shows a local-average of the frequency of transmission events. The green line shows a linear regression of true transmission events (coded zero or one) on the estimated infector probability. Histograms show the frequency of estimated infector probabilities when transmissions happen (top) and when they don't (bottom).

Histograms in

Left: Estimated infector probabilities based on the true transmission genealogy versus infector probabilities based on a sample of trees from the Bayesian phylogenetic posterior distribution. The red line shows

Comparing aggregated infector probabilities can be used to detect systematic differences in transmission rates between categories of infected individuals. Relative values of infector probabilities are not equivalent to relative transmission rates, and these statistics should not be interpreted as estimates of relative transmission rates. But, we do expect that relative infector probabilities to trend in the same direction as transmission rates.

In the simulations, EHI transmit at a greater rate than chronic infections by a factor of 12.4. If we compare medians of

To demonstrate the feasibility of detecting covariates that impact transmission rates that are not explicitly included in the calculation of

The high risk category transmits at a rate 10× that of the low risk category. Right: A quantile-quantile comparison of the distributions of log infector probabilities. A quantile-quantile comparison for undiagnosed and diagnosed is shown at bottom right.

To validate the numerical accuracy of our derivation of

In all, 1158 SIRS epidemics were simulated, 59194 potential transmission pairs were evaluated, yeilding 3168 within-sample transmission pairs. A sample of 5% of infections was taken at peak prevalence and endemic equilibrium. Bias was not detected for either sampling time (t-test

We have presented a method for calculating the probability that one host infected another (the infector probability) in a pathogen phylogeny. This method makes use of extra epidemiological information, such as the incidence and prevalence of infection over time. The method thereby accounts for the possibility that unsampled infected individuals act as either intermediaries or as a common source of infection for a putative donor and recipient of infection. Any infectious disease model that is used to estimate incidence and prevalence of infection implies a relationship between pathogen gene genealogies and infector probabilities. This is the first method which makes the connection between infector probabilities, infectious disease models, and pathogen genealogies explicit. The practical importance of this method is that it enables the estimation of infector probabilities in situations where there is incomplete sampling, which is more often than not the case for high-prevalence community-acquired pathogens like HIV.

Once

We have also demonstrated the method using a simulated HIV dataset in which we know who actually infected whom. The dataset was designed to mimic a real HIV dataset, both in terms how patients are sampled and in the epidemiology of infection in the simulated community; phylogenies were estimated from simulated sequences in order to realistically reproduce phylogenetic error. The method is subject to bias due to finite population size and violation of model assumptions. Nevertheless, we have not detected substantial bias in realistic simulation experiments, which suggests that bias will be quite small for applications provided an appropriate epidemiological model is used.

It is also important to note that this simulation study assumed perfect knowledge of incidence and prevalence of infection over time as well as perfect knowledge of the stage of infection at the time each infected host is sampled. In reality, there will be substantial uncertainty regarding both, and that would add additional error to estimated infector probabilities. Even though there is very high variance in the infector probabilities based on estimated phylogenies, the infector probability averaged across estimated phylogenies has similar performance as a statistic for classification (AUC of ROC).

There has been controversy

Our simulation experiments have demonstrated how infector probabilities are sensitive to many factors in addition to the structure of the phylogeny, such as details about who is sampled, when they are sampled, and the state of infected individuals at the time of sampling. Details of the epidemic process such as incidence and prevalence over time also influence infector probabilities. Most clustering methods employ a threshold genetic or evolutionary distance, but, as shown in

Even though transmission events could not be inferred with high confidence, the application of infector probabilities to epidemiological investigations of HIV seems promising in light of the results in

Our models have additional utility beyond the calculation of infector probabilies. Similar methods could be used to calculate the distribution of the number of unsampled infected individuals in a transmission chain between two sample units. For example, this has relevance for studies of the evolution of virulence of HIV

This method for calculating infector probabilities is based on a population genetic model that makes assumptions about the epidemiological and immunological process. The model does not account for the potential for superinfection, recombination, or complex within-host evolutionary dynamics which could confuse phylogenetic inference and decrease confidence in putative transmission links. Furthermore, the model does not account for multiple- or serial-sampling of a single infected host. Future research is needed on methods for relaxing these assumptions as well as for quantifying error that may arise from violation of model assumptions in realistic settings.

Simulated number of infections through time for the SIRS model.

(PNG)

Regression of true transmission events (ticks on axis coded zero or one) on calculated infector probabilities

(PNG)

Top: Model structure and population size over time for a model with three states. Blue arrows represent birth within and between states. Red arrows represent migration between states. Bottom: Regression of true transmission events (ticks on axis coded zero or one) on calculated infector probabilities. At right is shown the ROC curve if infector probabilities are used for classification of the event that a putative transmission pair is real.

(PNG)

The log of the expected number of transmissions to at least one other sample unit is shown in aggregated form for different stages of infection and diagnosis status (top). Each stage is represented twice in this figure because an infected individual may be undiagnosed or diagnosed (labels prefixed with ‘D.’). A sampled lineage from an undiagnosed individual corresponds to a situation in which a pathogen is sequenced at the same time that the patient is diagnosed. A quantile-quantile comparison of the distributions of log infector probabilities for EHI and chronic stages is shown at bottom left. A quantile-quantile comparison for undiagnosed and diagnosed is shown at bottom right.

(PDF)

The cophenetic distance between each pair of tips in the HIV gene genealogy is shown versus the calculated infector probabilities.

(PNG)

Additional simulation experiments.

(PDF)