^{1}

^{2}

^{1}

^{*}

Conceived and designed the experiments: CC. Performed the experiments: CC. Analyzed the data: AA CDS. Wrote the paper: AA CDS. Developed the theoretical methodology and responsible for the computational aspects: AA. Developed the theoretical methodology and responsible for overall supervision: CDS.

The authors have declared that no competing interests exist.

Retroviral vectors are widely used in gene therapy to introduce therapeutic genes into patients' cells, since, once delivered to the nucleus, the genes of interest are stably inserted (integrated) into the target cell genome. There is now compelling evidence that integration of retroviral vectors follows non-random patterns in mammalian genome, with a preference for active genes and regulatory regions. In particular, Moloney Leukemia Virus (MLV)–derived vectors show a tendency to integrate in the proximity of the transcription start site (TSS) of genes, occasionally resulting in the deregulation of gene expression and, where proto-oncogenes are targeted, in tumor initiation. This has drawn the attention of the scientific community to the molecular determinants of the retroviral integration process as well as to statistical methods to evaluate the genome-wide distribution of integration sites. In recent approaches, the observed distribution of MLV integration distances (IDs) from the TSS of the nearest gene is assumed to be non-random by empirical comparison with a random distribution generated by computational simulation procedures. To provide a statistical procedure to test the randomness of the retroviral insertion pattern, we propose a probability model (Beta distribution) based on IDs between two consecutive genes. We apply the procedure to a set of 595 unique MLV insertion sites retrieved from human hematopoietic stem/progenitor cells. The statistical goodness of fit test shows the suitability of this distribution to the observed data. Our statistical analysis confirms the preference of MLV-based vectors to integrate in promoter-proximal regions.

Understanding how retroviral vectors (such as Moloney Leukemia Virus–based vectors) integrate in the human genome became a major safety issue in the field of gene therapy, since a concrete risk of developing tumors associated with the integration process was assessed in the clinical setting. Moloney Leukemia Virus–based vectors are apparently characterized by a non-random integration pattern, with a preference for the vicinities of active gene transcription start sites. We approach the problem of non-random retroviral integration from a probabilistic point of view. We model a normalized integration distance from the transcription start site of the nearest upstream or downstream gene. From this model, we derive a simple and straightforward testing procedure to estimate how the transcription start site of a given gene may or may not attract integration events. Our approach overcomes the issues of different gene length, gene orientation, and gene density, which are often critical in analyzing integration distances from transcription start sites. The approach is tested on real experimental data retrieved from human hematopoietic stem/progenitor cells.

The transfer of a therapeutic gene into somatic cells (gene therapy) is a promising medical approach for the management of many inherited and acquired diseases. Among several systems developed for gene delivery, replication-defective viral vectors derived from retroviruses are the most widely used. In fact, after infecting a target cell, retroviral vectors deliver the therapeutic gene directly to the cell nucleus and stably insert it into the host cell genome; the process is commonly referred to as “integration”.

It has been observed that retroviral vectors integrating in the proximity of the transcription start site (TSS) of host genes may enhance or disrupt normal transcription

Understanding location preferences of retroviruses becomes crucial in evaluating both the safety profile of a therapeutic vector as well as the integration process

Just few years ago, retrovirus integration was believed to be random, and the chance of accidentally activating a gene was considered remote. Recent studies based on cellular and animal models (reviewed in

The empirical comparison between simulated (dotted line) and observed distribution leads the authors to conclude in favour of non-randomness of retroviral integration.

In this paper, we first show that a bell-shape distribution is not necessarily evidence of non-randomness. Then we introduce a new distance measure based on a normalization of the conventional ID. This new variable is assumed to follow a Beta distribution, thus allowing us to build a direct testing procedure for the non-random integration hypothesis. Applied to real experimental data, the estimated parameters provide a statistical measure confirming retroviral integration preferences for the proximity of TSSs.

Each retroviral integration is defined by its nucleotide position on the chromosome (UCSC Genome Browser, human genome assembly March 2006, hg18 release,

nearest gene: nearest 3′ or 5′ end of a gene

nearest upstream TSS

nearest downstream TSS

These definitions are applied to integrations landing within transcriptional units (intragenic) as well as to insertions mapping between two genes (intergenic). Integration distances from the nearest gene TSS and from the nearest 5′ and 3′ TSSs are then computed. IDs assume positive or negative values when the insertion nucleotide is located downstream or upstream of the TSS, respectively.

Notice that in this particular case the transcription start site (TSS) of the nearest gene coincides with the nearest downstream (3′) TSS.

Let _{0}: _{1}:

Starting from a common annotation criteria _{j}_{(X)} represents the TSS position of the nearest annotated gene _{k}

Let us now suppose random integration, that is ^{9}

The solid line is the kernel density estimate plotted within a ±30 kb window for a better graphical visualization of the ”bell-shape” curve.

In this picture, six hypothetical genes with different length and orientation (blue arrows) are scattered along a chromosome (x-axis). The purple piecewise linear function represents the distance from the TSS of the nearest gene. This function has discontinuities exactly in the middle of the intervals between two consecutive genes. Even assuming a series of random integrations in this setting, we obtain a distribution of distances from TSSs (projected on the y-axis, gray plot) which is a mixture of Uniform distributions. As a consequence, the bell-shape curve is observed. Notice that the ID distribution is asymmetric around zero, since gene orientations and gene lengths determine which is the TSS to be considered in computing the distances (a symmetric distribution would be observed plotting the distance from the nearest TSS instead of the nearest gene TSS, data not shown).

We next build a new testing procedure for non-randomness. We start by normalizing the r.v. _{D}_{U}

Let ^{*} be a new r.v. given by:^{*} now becomes independent of ^{*}≤1. In statistical terms, we assume as a convenient distribution for ^{*} the Beta distribution, which is one of the most widely used in clinical, biological, and genetic settings (Bayesian frameworks ^{*}≤1 and 0 otherwise,

The main aim of the modelling is the estimation of the parameters ^{*} is uniformly distributed in [0,1]”, that is equivalent to a Beta distribution with both ^{*} indicates that integrations land close to a TSS with higher probability (TSS ^{*} distribution (

Solid black line represents the case of Uniform distribution (_{1}:

In summary, we can now redefine the null hypothesis of random distribution of IDs in terms of values of the parameters (

Method-of-Moments Estimates (MMEs) are also provided since it is well known that MMEs can be quickly and easily calculated (see

Comparison between observed and fitted IDs distribution to assess goodness of fit is performed by the Kolmogorov-Smirnov test. Confidence intervals of 95% are built on Bootstrap 50,000 replications

Statistical analyses were performed with R-statistical software (ver. 2.6.1)

We apply the testing procedure described in Equation 5 to a real experimental dataset. This includes 595 integrations retrieved from human hematopoietic stem/progenitor cells (CD34^{+} population) isolated from umbilical cord blood and infected in vitro with MLV-based retroviral vectors (RV and SIN-RV datasets in

In ^{*} in favour of the hypothesis that the TSS “attracts” integrations.

Goodness of fit was assessed by Kolmogorov Smirnov test (MME

MME (95% CI) | MLE | ||

0.568 (0.502–0.646) | 0.599 | <0.0001 | |

0.551 (0.488–0.623) | 0.592 | <0.0001 |

Tumorigenesis induced by slow-transforming retroviruses occurs by insertional activation or deregulation of cellular proto-oncogenes by viral LTRs. Recent observations from gene therapy trials and pre-clinical models pointed out that MLV-derived retroviral vectors still retain this transforming ability, even if at a lower extent. Such genotoxic risk is augmented by MLV tendency to integrate near the TSS of host genes, where LTR transactivation can be more effective. For safety reasons, it becomes therefore crucial to understand the basis for retroviral integration site selection.

The goal of this paper is to provide a simple statistical tool to test whether integration data are distributed randomly over mammalian genome, in particular with respect to the transcription start site of genes surrounding integration events.

Our starting point is that integration distances generated in silico from a Uniform distribution show a bell-like shape as a consequence of different gene lengths and intergenic distances over the genome. Thus, when such shape is observed, it cannot automatically be interpreted as evidence of

We propose a new method based on modelling the probability distribution function of IDs between two consecutive start sites. The normalized distance is assumed to follow a Beta distribution, both for statistical tractability and for suitability to the biomedical framework. This method differs from the commonly used simulation techniques to the extent that it models fully parametrically the ID distribution, with no need for a computationally demanding procedure. A big advantage of the proposed approach with respect to simulation procedures derives from the natural interpretation of Beta parameters. As seen in

Estimation results derived from real experimental data show a U shape of the Beta distribution with a higher probability assigned to values in proximity of the TSS. Our statistical analysis confirms (also in human hematopoietic stem/progenitor cells) the preference of MLV-derived vectors to integrate in promoter-proximal regions, suggesting that the viral integrating machinery interacts preferentially with factors bound in the proximity of gene TSSs.

Supplementary Material

(0.04 MB PDF)

We gratefully acknowledge Fulvio Mavilio, Eugenio Montini, Alessandro Nonis, and Barbara Cassani for helping in the general understanding of the matter treated in this paper.