Conceived and designed the experiments: UA AD MS CvK HG. Performed the experiments: AN RG CCB. Analyzed the data: UA AD AN RG CCB. Contributed reagents/materials/analysis tools: UA. Wrote the paper: UA AD MS HG CvK.
¶ These authors also contributed equally to this work.
The authors have declared that no competing interests exist.
Vectors based on γretroviruses or lentiviruses have been shown to stably express therapeutical transgenes and effectively cure different hematological diseases. Molecular follow up of the insertional repertoire of gene corrected cells in patients and preclinical animal models revealed different integration preferences in the host genome including clusters of integrations in small genomic areas (CIS; common integrations sites). In the majority, these CIS were found in or near genes, with the potential to influence the clonal fate of the affected cell. To determine whether the observed degree of clustering is statistically compatible with an assumed standard model of spatial distribution of integrants, we have developed various methods and computer programs for γretroviral and lentiviral integration site distribution. In particular, we have devised and implemented mathematical and statistical approaches for comparing two experimental samples with different numbers of integration sites with respect to the propensity to form CIS as well as for the analysis of coincidences of integration sites obtained from different blood compartments. The programs and statistical tools described here are available as workspaces in R code and allow the fast detection of excessive clustering of integration sites from any retrovirally transduced sample and thus contribute to the assessment of potential treatmentrelated risks in preclinical and clinical retroviral gene therapy studies.
Various clinical gene therapy trials have been carried out demonstrating a clear benefit for many of the treated patients
Here, we describe methods and computer programs for the statistical analysis of the number of CIS as well as the number of IS involved in CIS. All computer programs referred to in the sequel were written in R code (cran.rproject.org). Technical details are provided in the Supporting Documents.
The following terminology will be used: A CIS of order n is defined as an ntuple of IS such that the maximum distance between the elements is no greater than a fixed bound d_{n}, the window size used for defining the CIS. While in our examples with relatively small sample sizes we chose the window sizes for CIS definition (d_{2} = 30 kb, d_{3} = 50 kb, d_{4} = 100 kb, and d_{n} = 200 kb, for n>4) to be identical to those used in earlier investigations
Notation used in the sequel:
is number of observed IS in the part of the genome under study
cis_{n} number of CIS of order n
iscis_{n} number of IS involved in CIS of order n
E(X) expected value of the random variable X
g length of the genome or the part of the genome under study
TSS transcriptional start sites
n_{TSS} number of TSS in the particular part of the genome under study
I_{TSS} interval(s) around a TSS possibly affected by preferential insertion of γretroviral vectors (the interval is assumed to be symmetric around the TSS)
w halfwidth of the interval(s) I_{TSS}
p_{TSS} proportion of IS allocated to the I_{TSS}
p_{pref} proportion of the TSS affected by the preference
G,H gene coding region and its complement (resp.) in the particular part of the genome under study
q_{G},q_{H} proportion of IS assumed to insert into gene coding regions and the complement (resp).
This paper is concerned primarily with the
Mathematical formulae for the expected value along with assumptions regarding the distributional form of f_{cis,n}. We assumed a Poisson distribution, which is an approximation to (and a limiting distribution of) the binomial in case of rare events. Thus, the Poisson distribution does not concern the IS but the number of CIS of order n. The approximation may be used if the probability of a random IS to be part of a CIS is small (<5%).
A more general and comprehensive approach relying on computer simulations of f_{IS}. In contrast to approach (1) this allows to take into account the spatial structure of the genes or the TSS. With computer simulations, no parametric model (like the Poisson distribution) for the distribution of the CIS is required.
Some explanations are in order to understand the scope of the analyses. If a fixed distribution f_{IS} is assumed then the analysis will merely yield a conclusion about whether or not the observed number of CIS of order n, cis_{n}, is compatible with this assumption (compatibility being measured by the pvalue). A small pvalue is indicative that the degree of clustering is stronger than implied by the model.
As an alternative, a
Finally, in some of the methods developed for comparing the number of CIS observed in two studies with different numbers of IS, the distributional assumption for the IS is not directly used for calculating expected values, but is rather treated as a nuisance parameter which determines the (necessary) adjustment of the results of the comparison.
As for pvalues, in many of our computer programs the user can select the direction of the statistical tests, namely, onesided (upper tail or lower tail) or twosided testing. Whenever H_{0} stipulates a uniform distribution, however, only onesided testing is appropriate.
Whenever a pvalue, p_{sim}, is based on computer simulations, it is only an estimate of the true pvalue p (which is a probability). If, e.g., the test statistic is given by the number of observed CIS of order n, the (onesided) p_{sim} is defined by the ratio of the number of simulation runs resulting in at least cis_{n} (i.e., the number observed in the experimental sample) CIS of order n, to the total number of simulation runs, nsim. As pointed out by Li et al.
Another aspect is the multiplicity of tests. Most analyses generate more than one pvalue, due to the fact that different orders of CIS are analyzed and/or different distributional assumptions (corresponding to different null hypotheses) are made. Hence, in some situations issues of multiple testing arise. There are numerous methodological strategies for dealing with multiple testing, see, e.g., Hsu
While it is known that γretroviruses do not show a uniform integration pattern, analyses of this type may be of interest when it comes to lentiviruses, see below.
In Abel et al.
The resulting formula for CIS of order n = 5 is given in the Supporting Document
The formulae were implemented in elementary programs
Generally, the approximations involved in the formulae are excellent. However, while the formulabased approach allows a very quick, rough orientation, in many situations computer simulations will be more satisfactory. First, the formulae may be dubious if g is not considerable larger than the window size d_{n}. Second, no formulae have been derived for orders >5. And third, no formulae are available for the number of IS involved in overlapping CIS. It is only when the CIS of order n can be assumed to be extremely sparse, so that overlaps can be neglected, that this number is approximately equal to cis_{n}*n.
The term γretroviral distribution will be used to designate a distribution of the insertions which assumes that insertions occur preferentially in the vicinity of the TSS, but are uniformly distributed in the remainder of the genome
Mathematically, the γretroviral distribution is a parametric class of distributions, the parameters being
the halfwidth w of the intervals I_{TSS}
the proportion p_{TSS}
the proportion p_{pref}.
(see above). As is easy to see, the uniform distribution is a special case of this class.
To obtain mathematical formulae, it must be assumed that the preferential allocation of IS expressed by p_{TSS} and p_{pref} is independent on the particular location of the TSS.
In Abel et al.
Again, this approach (made available in the program
For the human genome, Wu et al.
However, the assumption underlying the formulae, namely that CIS arising from IS located in two different (e.g. overlapping) I_{TSS} are negligible, may be problematic. As can be easily calculated, this approximation is, indeed, justified (with w = 5 kb) if the TSS can be assumed to be uniformly distributed. In reality, however, the distribution of the TSS in the genome is far from uniform, but rather shows a marked clustering, which then, by virtue of the preferential allocation of IS, may increase the expected number of CIS beyond the values implied by the formulas if a high percentage of IS are located in the I_{TSS}.
This observation is highlighted by the positions of the first 15 TSS on chromosome 1:
In other words, in order to perform a wellfounded analysis for γretroviral insertions, computer simulations are needed that take into account the
Lentiviruses are known to insert preferentially into the gene coding regions
If this assumption holds true, then statistical analyses  not taking into account the exact position and length of every single gene  can be carried out by applying the methods developed for uniform distributions separately to the gene coding regions and their complement.
This approach was implemented in the program
Again, this formula based approach is mainly meant for quick hypothetical modelbased calculations (“scenarios”).
We consider a data set of lentiviral IS in dividing mouse cells (SC1 mouse fibroblasts and hematopoietic progenitor cells), analyzed by our group
The integration site analysis yielded 611 IS, forming a total of 33 CIS of order 2. Using the program
A caveat similar to that made for γretroviral analysis also applies to lentiviruses: The formulabased analysis, which treats the genecoding regions and their complement as connected intervals, may be questionable if the number of IS is high so that many CIS are formed by combinations of IS from G and H. A more appropriate analysis, taking into account the exact structure of the regions is provided by the programs described in the next paragraph.
The basic methods and programs described above are cornerstones for more comprehensive analyses of IS data. Given a data set of IS locations, the analysis of CIS comprises at least the following steps:
Determine the number of CIS of order 2,3….(In our programs the maximum order analyzed was n = 30.)
Determine the location and number of IS involved in CIS of order 2, 3…
Compare these numbers with the expected values under a uniform distribution, the γretroviral distribution with preference for I_{TSS}, or a lentiviral distribution, as described above. I.e., these distributions are the null hypothesis H_{0} to which the pvalues refer.
All steps are performed both for each chromosome separately and genomewide. For each distribution two separate methods (denoted by the suffix c and u, resp.) were implemented representing a conditioning of the analysis on the observed numbers of IS on the chromosomes and the observed values of the model parameters, and an analysis without this conditioning, respectively. Additional technical remarks can be found in Supporting Document
The unconditional versions were mainly intended to test different hypothetical models. Therefore, the assumed model parameters (e.g., in case of lentiviral distributions: the proportion q_{G} of IS inserting in gene coding regions; in case of γretroviral distributions: the parameters p_{TSS}, p_{pref}) have to be furnished as program input. For chromosomes and those model features which are observable (this is not the case for p_{pref}, for which no straightforward method of estimation is available) the unconditional version of the programs then yields pvalues of the chisquared goodnessoffit test for the IS. E.g., in case of the uniform distribution, it is tested whether the observed numbers of IS on the chromosomes differ from those expected under H_{0} (which are proportional to the length of the chromosomes). In case of γretroviral and lentiviral models, additional goodnessoffit tests are carried out regarding the assumed values of the model parameters p_{TSS} and q_{G}, respectively.
The programs providing a conditional analysis are conditional both on the observed number of IS on the chromosomes and on the observable model parameters (p_{TSS} in case of γretroviruses and q_{G} in case of lentiviruses).
Thus, in all, the package comprises 6 programs carrying out steps 1 to 3:
All analyses require the specification of the species under investigation (rat, mouse, human). This determines the number and length of the chromosomes used in the analysis.
The γretroviral analysis (
For simplicity, the retroviral analysis assumes a uniform distribution of the IS within the I_{TSS}. As has been mentioned above, this special choice will hardly affect the number of CIS, given that the distribution inside of the I_{TSS} plays a role only for CIS arising from overlapping I_{TSS}. Also, as before, it is assumed that the preferential allocation of IS is independent on the particular location of the TSS so that random samples of the TSS can be drawn when modeling H_{0}. The structure of the programs
The analysis of lentiviruses requires the exact positions of all genes on the chromosomes (stored as a global matrix in the R workspace).
No separate counting of CIS is done for the union of the I_{TSS} in case of γretroviruses and for genecoding regions or their complement in case of lentiviruses, because, as mentioned above, these regions are highly disconnected and composed of subintervals many of which are smaller than the defining window sizes for CIS.
The programs CISRETROc and CISRETROu give the expected numbers and pvalues of CIS and IS involved in CIS based on a γretroviral IS distribution using MonteCarlo methods. 7 subprograms work together to produce the results. fp: calculates pvalues based on the simulated distribution of results; fvis: generates uniformly distributed IS locations; ftssc, ftssu: generate randomly distributed IS in the I_{TSS}; feval: carries out the statistical analysis; compress: compresses highly disconnected genomic regions produced when discarding the I_{TSS}; ciscount: counts the CIS; Subsim_c, Subsim_u: carry out the simulations and count the CIS for each simulation run.
As mentioned above, because of the overlap of the I_{TSS} (as is the case on human chromosome 1) the formulabased approach may be unsatisfactory when dealing with a large number of IS which are heavily concentrated in the I_{TSS}. To support this claim we consider an example of 319 IS on chromosome 1 (a value found in one of our studies), and assume the extreme case that p_{TSS} = 1. If p_{pref} = 1 the mean value of CIS of order 2 obtained in 10,000 simulation runs taking into account the length of the first chromosome (249,250,621 bp) and the exact location of the 2,135 TSS on this chromosome, was 50.4, compared to a formulabased expected value of 23.7.
We applied the program
In many experiments it is necessary to compare the results (locations of vector integrations) from two vector integration studies e.g. when the IS profile of two different vectors used in clinical trials have to be determined. One aspect of interest is the inherent propensity of the IS of these vectors to form CIS. Often the patient material that can be used for integration site analysis is limited so that it is not possible to get a comparable amount of DNA. Usually this implies that the numbers is_{1}, is_{2} of observed IS in the two samples will be different. The challenge with such an unbalanced comparison is that the number of IS itself affects the expected number of CIS. Even with random uniform allocation this dependency is strong. Thus the challenge arises how to eliminate the influence of the sample sizes of the IS on the comparison of the CIS.
We have taken two different approaches to this challenge. The first applies to the number of CIS only. It has a firmer theoretical foundation but depends explicitly on some assumptions regarding the distribution of the IS. The methods exploits the general fact that if X_{1} and X_{2} have Poisson distributions with parameters (i.e., expected values) λ_{1} and λ_{2}, respectively, then the difference X_{1}–X_{2} follows a Skellam distribution with parameters λ_{1},λ_{2}. (The Skellam distribution is available as a CRAN package in R.) In the applications the true expected values λ_{1},λ_{2}. are unknown. However, they can be calculated (either from a formula or from simulations) if a particular model for the distribution of the IS is assumed.
For the γretroviral model proposed by Wu et al.
The second approach (programs
The method has been implemented for the number/proportion of IS involved in CIS (for which no Poisson distribution can be assumed). Briefly, it proceeds as follows: Let IS_{1} and IS_{2} be the samples of is_{1} and is_{2} integration sites, respectively, and assume first that is_{1}≫is_{2}. Random samples of size is_{2} are drawn repeatedly (say, nsamp times) without replacement from IS_{1}, and for each of these samples the numbers of IS in CIS of different orders are counted. This yields simulated distributions of these numbers, with which the observed numbers of CIS in IS_{2} are then compared to obtain empirical pvalues.
If is_{1}≈is_{2} this method is unfeasible, however, because all random samples will become highly similar. A variant of the method can then be tried using nsamp random subsamples of identical size ≪min(is_{1}, is_{2}) from
We emphasize that  exceptionally  drawing
At first glance, since the samples of IS which are the basis for the calculation of pvalues are of identical size and only the sample distribution is used, the comparisons involved in this method appear to be neither affected by the differences in the sample sizes is_{1} and is_{2}, nor to depend on distributional assumptions for the IS. However, as extensive simulation studies have shown, this is not true. There is a dependence on various parameters conveyed by an inflation of the type I error, which, incidentally, is generally much higher in case of variant 2 than variant 1. This inflation is due to the fact that drawing (without replacement) from the samples of IS is not the same as drawing repeatedly from the theoretical parent distribution of the IS.
The inflation of the typeI error means that for every concrete data analysis a simulation study must to be carried out in order to determine how the nominal αlevel needs to be adjusted.
To illustrate the application of the method and the αadjustment, consider two real samples of 2,289 vs 1,152 γretroviral IS [Deichmann et al., unpublished results]. The samples contained 2,078 vs 161 CIS of order 2, which comprised 823 vs 236 (i.e.35.9% vs 20.5%) of the IS. The empirical pvalue for IS in CIS of order 2 produced by variant 1 (10,000 runs) was p = 0.0038, whereas variant 2 (10 pairs of samples of size is_{2}/2 = 576, 1,000 repetitions) yielded p = 0.0009. The simulation study analyzing the typeI error for this situation and assuming a uniform distribution of the IS resulted in estimated real α levels of 10.2% and 21.0% for variant 1 and 2 (resp.). Also, it was found that the nominal significance level would have needed to be lowered to 2.2% and 0.39%, respectively, to result in a real typeIerror of α = 5%. Note, that the results of the comparison remained highly significant even after the adjustment.
In a recently carried out hematopoietic stem cell gene therapy of ALD in two patients
The statistical inference (expected values E(coinc) of the number of coincidences and pvalues p for the observed number of coincidences) is carried out under the null hypothesis H_{0} that, if no contamination occurs, the IS locations in the two cell lines are represented by independent variables with lentiviral distributions as described above. Two situations were considered:
No contamination.
Contaminations do occur. It is assumed that the proportion of contaminated cells is the same for both cell types. The analysis takes the robust (worst case) stance that every IS in the contaminated part of the analyzed cells leads to a coincidence.
For the mathematical formulae and some technical details see the Supporting Document
Given that nearly all leading genetherapy studies use integrating viral vectors, there is a need for mathematical and statistical tools tailored for the analysis of viral integration sites. In this paper, we focus on methods and computer programs for the analysis of common integration sites (CIS), with applications both to γretroviruses and lentiviruses (which show different integration patterns).
Our methods and programs focus on the analysis of the
Starr et al.
We have devised formulabased approaches useful for a quick analysis, as well as simulationbased methods, which are appropriate for samples showing intensive clustering in specific regions and which take the entire exact genome localization of the TSS (in case of γretroviruses) or of genes (in case of lentiviruses) into account.
An overview of the program package is given in
Program  Objective 



expected numbers of CIS and pvalues for the observed numbers of CIS, assuming a uniform distribution of the IS 

ditto, γretroviral IS distribution (CIS of order 2 only) 

ditto, lentiviral IS distribution 

coincidences of IS in two cell types without contaminations 

coincidences of IS in two cell types with contaminations 

comparison of the numbers of CIS from two experiments with different numbers of IS (with expected numbers given) 

ditto for γretroviral IS distribution and unknown expected numbers (only CIS of order 2) 



counting of CIS in a given set of IS locations 

counting of IS involved in CIS 

enumeration of the locations of IS involved in CIS 

generation of uniformly distributed IS locations 

statistical analysis of the results of simulation studies 

subroutine used to compress highly disconnected genomic regions produced when discarding the I_{TSS} 

generation of randomly distributed IS in the I_{TSS} 



expected numbers and pvalues for CIS and IS involved in CIS (expected numbers based on uniform IS distribution, p values based on given total numbers) 

ditto, using given IS locations; conditional analysis 

ditto, using given IS locations; unconditional analysis 

ditto, with expected numbers based on a γretroviral IS distribution; conditional analysis 

ditto, with expected numbers based on a γretroviral IS distribution; unconditional analysis 

ditto, with expected numbers based on a lentiviral IS distribution; conditional analysis 

ditto, with expected numbers based on a lentiviral IS distribution; unconditional analysis 
comparison of the numbers of CIS from two experiments with different numbers of IS (assuming uniform IS distribution), according to method1 and 2, resp. (see text) 
For each IS distribution modeled in the simulations, two different methods of analysis were implemented: a
In the
While the unconditional analysis is useful for trying and assessing hypothetical models, conditioning, at least on the model parameters, is preferable in the analysis of real data, where estimations of these parameters are available. As for chromosomes, the considerations are different, because (in contrast to the parameters p_{TSS} or q_{G}) the proportions of IS on each chromosome are not among the parameters of the mathematical models. An analysis without conditioning on the chromosome essentially treats the chromosomes as undistinguishable, except for characteristics specified in the IS distribution under H_{0}, e.g. the locations of gene coding regions or TSS. By contrast, conditioning on the chromosomes is appropriate if there is evidence (either biological or statistical one) that further factors exist  of little or no interest, but differing across the chromosomes  affecting the number of IS and thus (indirectly) the expected number of CIS. In the conditional analysis, these chromosomespecific influences on the number of CIS are corrected for by taking them into account under H_{0}, i.e., in the simulated distribution of the IS.
Summarizing, the unconditional and conditional approaches differ in their assumptions, methods of analysis, and results (see
The comparison of the integration patterns, and in particular CIS, in different clinical gene therapy studies necessitates an adjustment for different numbers of IS. We present two different methods of adjustment: a formulabased approach, which has a theoretical foundation but is sensitive to assumed values of the input parameters, and a simulationbased approach which is less limited in scope and does not have explicit distributional assumptions, but is somewhat heuristic.
Another challenge closely related to CIS analysis is the occurrence of coincidences of IS in different cell types. In many gene therapy studies such coincidences may help to understand which celltype was initially transduced and how the differentiation occurs. We have developed methods and computer programs comparing the observed number of coincidences with the number to be expected by chance alone, accomodating a certain level of contamination.
In our lab, the presented programs have been applied to various experimental samples and proven helpful in assessing potential vectorinduced sideeffects.
(DOC)
(DOC)
(DOC)
(DOC)