^{1}

^{¤}

^{1}

^{2}

^{3}

The authors have declared that no competing interests exist.

Current Address: Center for Soft Matter Research, Department of Physics, New York University, New York, United States of America

Chromosome conformation capture (3C) techniques have revealed many fascinating insights into the spatial organization of genomes. 3C methods typically provide information about chromosomal contacts in a large population of cells, which makes it difficult to draw conclusions about the three-dimensional organization of genomes in individual cells. Recently it became possible to study single cells with Hi-C, a genome-wide 3C variant, demonstrating a high cell-to-cell variability of genome organization. In principle, restraint-based modeling should allow us to infer the 3D structure of chromosomes from single-cell contact data, but suffers from the sparsity and low resolution of chromosomal contacts. To address these challenges, we adapt the Bayesian Inferential Structure Determination (ISD) framework, originally developed for NMR structure determination of proteins, to infer statistical ensembles of chromosome structures from single-cell data. Using ISD, we are able to compute structural error bars and estimate model parameters, thereby eliminating potential bias imposed by

Spatial interactions between distant genomic regions are of fundamental importance in gene regulation and other nuclear processes. Recent chromatin crosslinking (“Hi-C”) experiments probe the spatial organization of chromosomes on a genome-wide scale to an extent that was previously unattainable. These experiments report on contacting loci and thus provide information about the three-dimensional structure of the genome. Unfortunately, the data are noisy and do not determine the structure uniquely. There is also little quantitative prior knowledge about the large-scale organization of chromosomes. Here, we address these challenges by developing a Bayesian statistical approach that combines a minimalist polymer model with chromosome size measurements and conformation capture data. Our method generates statistical ensembles of chromosome structures from extremely sparse single-cell Hi-C data. We remove potential bias by learning modeling parameters from the experimental data and apply model comparison techniques to investigate which among a set of alternative models is most supported by the Hi-C data. Our method also allows for modeling with ambiguous contact data obtained on polyploid chromosomes, which is an important step towards three-dimensional modeling of whole genomes.

The rapid development of chromosome conformation capture techniques such as 3C [

Chromosome conformation capture experiments typically analyze populations of millions of cells, thereby only providing a population-averaged view. Recently, however, Nagano

Many structural insights such as the existence of TADs or the scaling behavior of contact probabilities with genomic distance can be found by analyzing genome-wide contact matrices. Nevertheless, it seems attractive to obtain a more direct view of the 3D architecture of genomes by structural modeling based on the experimental contact information. To compute representative 3D structures of genomes, various approaches have been explored.

There is a growing array of computational methods for calculating consensus structures from population Hi-C data. Typically, these methods first derive distances from the experimental contact frequencies by using different heuristics. In the early work by Duan

Two major challenges complicate the adaptation of methods for chromsome structure inference from population Hi-C to single-cell data. First, single-cell Hi-C measures only the formation of a contact rather than contact frequencies. Second, only a small subset of all chromosomal contacts is measured such that the contact information is very sparse. Therefore, specialized methods for the analysis of single-cell Hi-C contacts need to be developed. Multidimensional scaling (MDS) is a popular method to obtain three-dimensional structures from incomplete and noisy distance information and was already used in the first publication on chromosome conformation capture [

However, the application of optimization approaches such as SA or MDS to chromosome structure determination suffers from the same conceptual problems described by Rieping

To address these issues, Rieping

Here we report on the application of ISD to infer statistically well-defined ensembles of chromosome structures from single-cell Hi-C data. We show that Markov chain Monte Carlo (MCMC) sampling allows us to compute diverse ensembles of coarse-grained chromosome conformations that reflect the sparsity of single-cell Hi-C contacts. MCMC techniques and the flexibility of our Bayesian approach also allow us to compare different models of the chromatin fiber as well as alternative models for Hi-C contacts. We use the conformational ensembles to map epigenetic marks into three-dimensional space. Furthermore, we demonstrate that ISD outperforms alternative methods on simulated data. Finally, we show how to extend the approach to diploid chromosomes and infer the structures of two chromosome copies simultaneously.

We model the chromatin fiber with a beads-on-a-string representation. Owing to the sparsity of single-cell Hi-C contacts, we use a highly coarse-grained model in which every bead represents 500 kb of chromatin and has a radius of approximately 215 nm. Beads are connected such that they form a linear chain. The connectivity is enforced by a harmonic backbone potential, which penalizes distances between consecutive beads as soon as they exceed the bead diameter

We first studied the properties of our model for the single-copy X chromosome of male mouse, which measures 166 Mb in length, and which we represent using 333 beads. We generated structures from the prior distribution and reconstructed the distribution of the radius of gyration _{g} as a measure for the compactness of the fiber. Because there are no or only weak attractive interactions between the beads, the vast majority of structures generated from the prior showed an extended conformation. _{g} value. There is a strong entropic force that pushes the fiber into an extended state characterized by a large radius of gyration. For the repulsive excluded volume term we found _{g} = 11.2 ± 1.1 μm; for the Lennard-Jones potential we have _{g} = 8.6 ± 1.3 μm. Due to the attractive contribution in the excluded volume term, the Lennard-Jones potential shows a higher preference for compact structures. Yet for both potentials, the fraction of compact structures with _{g} smaller than 2 μm, say, is vanishingly small: 1.8 × 10^{−24} for the quartic repulsion potential and 7.8 × 10^{−21} for the Lennard-Jones term.

The preference of the prior probability for extended structures with large average radii of gyration is incompatible with fact that chromosomes localize in chromosome territories, i.e. relatively compact subcompartments of the nucleus [_{g}. Using this approximation, the experimental chromosome size measurement of 3.7 ± 0.3 μm obtained with X-chromosome paint FISH [_{g} ≈ 1.43 ± 0.12 μm. Therefore, the average radii of gyration reported above are an order of magnitude too large compared to the experimental finding.

To incorporate the information from FISH into our probabilistic chromosome model and thereby inform the prior probability about the expected chromosome size, we assume a Gaussian error model for the chromosome size measurements. Based on our approximate relation between chromosome size and _{g}, this term corresponds to a harmonic radius of gyration restraint with an experimental _{g} value of 1.43 ± 0.12 μm.

By using probability calculus, it is possible to combine single-cell Hi-C contacts with the FISH data and our chromosome models (see

We used a logistic model to quantify the probability that a _{c} = 1.5 × _{c} has a probability smaller than 10^{−6}.

We generated ensembles of X-chromosome structures for both excluded volume potentials, both without and with the additional model for FISH data introduced in the previous section. To sample chromosome conformations we used Hamiltonian Monte Carlo (HMC) [

As a first validation of our inference approach we studied whether all experimental contacts can be satisfied in the 3D model of the X chromosome.

_{c}.

As a further validation we analyzed the average pairwise distance matrix computed from the sampled X-chromosome structures.

Nevertheless, we can use Bayesian model comparison to study whether the data show any preference for one of the prior probabilities. To do so, we estimated the model evidence (also known as marginal likelihood) from the MCMC simulations. The model evidence quantifies how likely a probabilistic model is in the light of the data.

_{g} of 1.32 μm. Due to the strong forces exerted by the logistic contact restraints, the ensemble is slightly more compact than suggested by the FISH data. When incorporating the FISH term, the average _{g} shifts towards larger values with an average of 1.38 μm, corresponding to a chromosome size of ∼3.6 μm. The FISH data do not compromise the fit with the contact restraints: the number of violations does not change upon incorporation of the _{g} model (see

By looking at the variance of the pairwise distances in the structure ensembles (

Many approaches to infer chromosome conformation from Hi-C data resort to modeling based on distance restraints. To this end, pairwise distances need to be derived from the experimental contact information. For example, the contact frequencies measured in ensemble Hi-C experiments were converted to distances by assuming a power law that relates the contact probability to the inter-bead distance, which is motivated by results from polymer physics (see e.g. [

To interpret the observation of a single-cell Hi-C contact as a distance measurement, we introduce an unknown distance _{ij}(_{ij}(_{ij} counts how often beads _{ij} ranges from 1 to 3. In our approach, the repeated occurrence of a contact (_{ij} > 1) does not lead to a shortening of the contact distance, but rather to an enforcement of the distance restraint, which is duplicated _{ij} times.

Due to experimental errors and shortcomings of our model, we have to account for discrepancies between the unknown experimental distance and the model distances. This is achieved by introducing a probabilistic model for the distribution of the discrepancy between _{ij}(^{2} can be interpreted as the weight of the distance restraint potential [

To estimate the experimental distance using ISD, we rewrite _{ij}(_{c} assumed in the logistic contact model. The distances involved in an experimental contact (shown in

^{−4} to 1 in a replica exchange simulation to facilitate sampling of the posterior distribution).

Using Bayesian model comparison, we can also answer which of the two error models, Gaussian with a flat plateau or lognormal model, is preferred by the experimental data.

By running calculations in which we varied the error parameter

Our results show that it is possible, in principle, to model single-cell Hi-C contacts as distance measurements. However, we will use the logistic contact model in the remainder of this article, because single-cell Hi-C observes binary contacts rather than continuous distances.

We now take a closer look at the ensembles generated with the ISD approach and compare them to the published ensemble by Nagano

Cluster analysis reveals that the ISD ensemble comprises multiple principal conformations about which the structures fluctuate. Closer inspection shows that the cluster centers are partial mirror images of each other. None of the likelihood and prior factors contributing to the posterior distribution distinguishes between a particular bead configuration and its mirror image, because all factors depend on distances only. Moreover, since there are only few contacts between the super-domains, each super-domain can show two conformations which are mirror images of each other. This results in at least four possible chromosome conformations which all achieve a similar goodness of fit of the Hi-C contacts.

Our cluster analysis finds that the eight most dominating structural clusters produced by ISD cover ∼90% of all states sampled from the posterior (see

Visual inspection of the structural clusters suggests that the overall variability in the ISD ensemble is quite high and comparable to the fluctuations in the ensemble by Nagano

For each cluster, we studied the local variability of the beads by using standard techniques for the analysis of NMR structure ensembles. We estimated the local precision of the bead positions by the root mean square fluctuation (RMSF) after superposition of the cluster members onto the cluster center.

We also ran ISD simulations on contact data from 5 additional Th1 cells. The average distance matrices indicate that the ensembles are significantly different, indicating the cell-to-cell variability of chromosome conformations found by Nagano

By comparing the average distance matrices (

A problem with current Hi-C based chromosome modeling is that it is difficult to validate the calculated structures. However, there are some independent sources of information, not used during modeling, that should be consistent with a meaningful structure ensemble. One is the information provided by population Hi-C. Although population Hi-C looks at a large pool of cells, the information about the

Based on the inferred structure ensemble it becomes possible to generate three–dimensional maps of genomic and epigenetic features and to correlate the features spatially.

Due to the sparsity of the chromosomal contacts, we have used a very coarse-grained representation of the chromatin fiber. At this resolution, we can only study large-scale chromosomal organization. Higher resolution representations are typically needed to gain biologically relevant insights into 3D chromosome organization. We therefore also applied the ISD approach using a ten-fold higher resolved chromatin fiber. Each bead now represents 50 kb of chromatin, thereby matching the finest resolution used by Nagano

At 50 kb resolution, we represent the X chromosome with 3330 spherical beads. We generated structure ensembles based on the Lennard-Jones volume exclusion term and the additional FISH restraint. We modeled the intra-chromosomal contacts with the logistic model; in contrast to Nagano

To compare the overall properties of the structure ensembles generated at 500 kb and 50 kb resolution, we downsampled the distance matrices from the 50 kb models to match the resolution of the coarse-grained models. Downsampling was achieved by averaging 10 × 10 patches of the average distance matrix at 50 kb resolution. The downsampled average distance matrix and the average distance matrix of the low-resolution model show a correlation of 86%, indicating that the overall shape of the X chromosome at both levels of resolution is similar (see also

^{−3/2} ideal chain/equilibrium globule, ^{−1} fractal globule.

^{−1} dependence of the contact frequency, in contrast to the equilibrium globule which is expected to show a scaling behavior of ^{−3/2} until it reaches a constant value. Our 50 kb ensemble shows a mixed packing supporting a more complicated packing than the fractal globule also in single cells.

Because Th1 cells of male mice are diploid, most

A reason for the failure of the naive approach is that it tries to assign contacts to either of the two copies similar to an ^{−6} averaging). In case of NMR structure calculation, this type of averaging can be motivated by the physical nature of the NMR signal. But we can also use it in different applications to implement some kind of

To model two-copy chromosomes with ISD, we use two independent chromatin fibers and describe the observation of Hi-C contacts using a logistic model. However, in contrast to the single-copy approach, the logistic function is evaluated for the ADR computed by ^{−6} averaging of the corresponding distance in each copy. We only introduce intra-fiber distance restraints, although contacts could in principle arise also from inter-fiber contacts between the two copies. We exclude this possibility and assume that homologous chromosomes segregate into distinct territories that have no physical contact [_{c}, all contacts can be satisfied in only one of the two copies or in both. There is a cluster of contacts that are found in both structures (

Our model of the two-copy chromosome 1 suggests that homologous chromosomes exhibit a degree of structural variability within the same cell that is similar to the cell-to-cell variability across different cells. With a correlation coefficient of 51%, the difference between the distance matrices of both chromosomal copies is similar to the difference between structures of the X chromosome from different cells (see

This article introduces a Bayesian probabilistic approach to infer the three-dimensional structure of chromosomes from sparse 3C contacts measured in single cells. Our approach builds on the ISD framework, which was originally developed for NMR structure determination of proteins. It is based on a posterior probability distribution over the space of chromosome conformations and model parameters. The posterior probability distribution integrates the information from single-cell Hi-C contacts and FISH measurements with a coarse-grained model of the chromatin fiber. We demonstrate the strengths of the ISD approach for single-cell Hi-C measurements on Th1 cells of male mouse [

Using Markov chain Monte Carlo algorithms, we can generate statistically valid structure ensembles that represent the posterior probability. The ensembles can be used to compute a local error bar for the bead positions or the distribution of other structural parameters such as the radius of gyration and indicates which chromosomal regions are well-defined by the input data. Along with the chromosome structure, ISD also estimates model parameters such as the distance between two contacting loci.

The Bayesian framework allows us to not only estimate bead positions and model parameters but also to compare alternative descriptions of the chromosome fiber. Here we compared two different volume exclusion potentials, a purely repulsive potential and a potential with both attractive and repulsive contributions, and showed that the latter is preferred by the experimental data. Our approach can also quantify the information content of individual data sets. We showed that the incorporation of FISH data helps to define the chromosome conformation. Bayesian model comparison can also be used to select the best among alternative models of the experimental contacts in a data-driven fashion. This is demonstrated here for two distance-based restraint potentials.

All of these findings are shown in detail for the single-copy X chromosome at 500 kb and 50 kb resolution as well as demonstrated on synthetic data. It is also possible to analyze contact data for diploid chromosomes. By using ambiguous distance restraints, again a concept from protein NMR, we can also infer the structures of both chromosome copies simultaneously and thereby disentangle intra-chromosomal contacts that stem from either of the two copies. These features are beyond the scope of previous methods such as restraint-based modeling [

There are many possibilities for future applications and extensions of ISD to chromosome conformation capture data. An important aspect will concern the implementation and testing of more elaborate models of the chromatin fiber. Here we worked with a homopolymer model according to which the fiber is composed of equally sized beads that all interact by means of a single volume exclusion potential. This model ignores many known properties of chromosomes such as the existence of hetero- and euchromatin. Nucleosome positioning and other epigenetic data could be used to define a more realistic model of the fiber. Moreover, it might be required to introduce additional prior terms that control the persistence length and local flexibility. More realistic chromosome models have been proposed such as the Strings and Binders Switch (SBS) model [

Here we focused on the use of single-cell Hi-C data. It is straightforward to also incorporate contacts from ensemble Hi-C experiments into our Bayesian modeling framework. One possibility would be to model scarcely populated contacts in the ensemble Hi-C matrix as “anti-contacts”, although the absence of contacts has to be interpreted with caution [

In conclusion, our work shows that ISD provides a statistically sound and viable alternative to restraint minimization or embedding approaches for single-cell Hi-C data and produces less biased and statistically valid ensembles of chromosome conformations.

The Inferential Structure Determination (ISD) approach [

The likelihood Pr(

We represent a chromosome by ^{3} [^{3}. Therefore the radius of a bead is approximately 215 nm. At a resolution of 50kb, the bead radius is approximately 100 nm.

The distance _{i, i+1} between two consecutive beads _{bb} = 250/^{2}. Overlaps between two beads are penalized using an excluded volume term _{nb} with force constant _{nb}. The conformational prior distribution defined over bead positions _{1}, …, _{N}) is given by the canonical ensemble:
_{bb}, etc. Here and in

Here we study two nonbonded force fields to account for volume exclusion effects. The first force field consists only of a repulsive term that is activated as soon as two beads come closer than their diameter _{nb} = 5/^{4}. The second excluded volume term is a Lennard-Jones potential with linear asymptotes:
_{cut} = 1.375 ×

FISH experiments measure the size of chromosome territories and show that the chromatin fiber adopts a compact conformation during interphase. To incorporate FISH data into our model, we first need to measure the size of a chromosome given the bead positions

Let us now assume that FISH experiments result in a list of chromosome size measurements _{1}, …, _{M}. We model each observation using a Gaussian error model such that the probability of observing the _{FISH} is the error of the FISH measurements. The complete probability of all FISH measurements is:

We studied three probabilistic models to measure the compatibility of a chromosome structure _{ij} denotes the corresponding number of counts in the binned contact matrix. Distance / contact restraints between beads _{ij} > 1 are duplicated _{ij} times.

The distance-based likelihoods assume that the observation of a single-cell Hi-C contact between loci _{ij}(_{ij}(_{i} − _{j}‖ is the Euclidean distance between the corresponding beads. Since

To do this, we take the unknown distance _{ij}(

An error model accounts for deviations from the ideal relationship _{ij}. These deviations are captured with a probability distribution _{ij}(

The second error model is a lognormal distribution, which was introduced by Rieping

As an alternative to distance-based modeling, approaches directly based on the contact information have been proposed [_{c}, the corresponding loci are in contact; we set _{c} = 1.5 ×

Our model is reminiscent of logistic regression, a technique to fit linear models to binary data arising in classification problems. Here, the two classes correspond to contacts and non-contacts between two loci. However, in contrast to standard applications of logistic regression, only the first of the two classes is observed, because single-cell Hi-C gives us only information about contacting loci; the absence of a contact cannot be interpreted as the observation of a non-contact. With this degenerate data set, the regression parameters themselves remain undefined. Therefore, we have to fix the regression parameters _{c}. One possibility to estimate these parameters would be to combine single-cell with ensemble Hi-C data: the absence of ensemble Hi-C counts for pairs of mappable loci could, with some caution [

When modeling the structure of diploid chromosomes, an observed single-cell Hi-C contact is ambiguous in the sense that it could be formed in either of the two copies. To deal with this ambiguity, we adopt an idea from NMR which is commonly used to work with distances that are ambiguous due to chemical shift degeneracy. Ambiguous distance restraints (ADR) compute an effective distance resulting form all alternative possibilities [_{1} and _{2}. The effective distance between beads ^{−6}-average:
_{ij}(_{k}) is the distance between loci _{ij}(_{1}) and _{ij}(_{2}) are smaller than the cutoff _{c}, the effective distance _{c} and the contact will be violated.

For single-cell Hi-C data of diploid chromosomes it makes sense to use an _{c}. This is demonstrated by structures of the X chromosome calculated with contacts from six different cells.

As a first test, we ran calculations with distance data. For both protein structures, we used all distances smaller than 10Å as input resulting in 547 distances for ubiquitin and 504 distances for the HRDC domain. ISD calculations for the invidual proteins showed that this incomplete distance information is sufficient to obtain a highly accuracte structure model. For ubiquitin, the mean structure achieved an RMSD of 0.6 Å to the correct structure. For the HRDC domain, we obtained a model of even higher accuracy with an RMSD of 0.4 Å.

Next we combined both sets of distances and tried to reconstruct the structure of ubiquitin and the HRDC domain simultaneously by the use of ambiguous distance restraints. Indeed, ISD was able to disentangle the distance and to obtain fairly accurate models of ubiquitin and the HRDC domain.

First, we ran ISD on the individual contact data. In both cases, we obtained ensembles that were close to the correct structure, but which also contained mirror images. The RMSD of the average structures is 2.2 Å (ubiquitin) and 1.9 Å (HRDC domain). To test if the use of ADRs is capable of disentangling mixed contact data, we combined the contacts from ubiquitin and the HRDC domain into a single set of 636 contacts, which we modeled in the same way as the diploid chromosome data.

Because the full posterior probability Pr(_{1}, _{2},

An essential step in simulating Pr(_{n}, which influence each other in a non-trivial way.

If there are multiple structures that are consistent with the distance restraints and the prior information, the conditional posterior distribution Pr(

A large class of methods for sampling from general probability distributions are Markov Chain Monte Carlo (MCMC) algorithms, which construct a Markov chain whose unique limiting distribution is the distribution one aims to simulate. Simple MCMC methods such as the Metropolis-Hastings algorithm [

Three parameters have to be set when using HMC for sampling. First, since HMC introduces fictitious momenta for each conformational degree of freedom, a mass matrix has to be specified. Second, the equations of motion solved during the short MD simulation are usually not analytically integrable and have to be solved by numerical integration schemes that require a discretization timestep. Third, the length of the MD trajectory needs to be chosen.

Rules-of-thumb to set these parameters exist [

^{−2} is a Gamma distribution for which sampling routines are readily available in many programming languages. In case of the Gaussian error model with a flat plateau, we use the Metropolis-Hastings algorithm to generate samples of

Because the Gibbs sampler (^{−1} that facilitates posterior sampling. A family of tempered posteriors is introduced:

The normalizing constant Pr(

We used similar techniques to study the properties of the chromosome model (see Chromosome models). Because the prior alone prefers extended chromosome conformations, we imposed an additional radius of gyration term whose strength is varied smoothly via the inverse temperature λ of the replicas:
_{g}. The entropy measures the number of conformations with a certain radius of gyration and is shown in

Structure ensembles reconstructed from single-cell Hi-C contacts are typically less well defined than NMR structure ensembles. Therefore traditional measures to characterize and compare structure ensembles such as RMSDs are of limited use.

The structural properties of an ensemble of chromosome conformations can be assessed by analyzing the distances matrices computed for all members of the ensemble. This does not require a structural superposition of the ensemble members and also works in the presence of mirror images. We mainly use the sample mean and standard deviation over all distances matrices from one ensemble for comparison with another ensemble. More formally, let a posterior ensemble contain _{m} for every ensemble member. The elements of _{m} are:

For a pair of beads

The agreement of two structure ensembles can be assessed by comparing the corresponding average distances matrices. The cross-correlation coefficient of two distance matrices provides a measure of the similarity of the structure ensembles, but since the entries in the distance matrices are highly dependent, it is unclear if the correlation is statistically significant. The Mantel test [^{4} random shufflings of one of the distance matrices).

Structural ensembles were clustered using spectral clustering [_{ij} is the RMSD between conformations

To validate our structure calculation method, we ran several test calculations for a three-dimensional Hilbert curve comprised of 512 beads. We assumed a cutoff distance of 1.5 × bead radius, which resulted in a maximum of 3696 contacts (∼2.8% of the full distance matrix, which has 512 × 511/2 = 130816 elements in total).

With the complete set of observable contacts, we obtained an ensemble that comprised two principal conformers which are mirror images of each other. The first cluster shows an average RMSD of 0.09 bead radii from the correct model and is populated by 48% of all sampled structures. The second cluster achieves the same RMSD to the mirror image of the ground truth and is populated by the remaining 52% of all structures. This shows that ISD is capable of drawing correct samples from the posterior distribution: Distance data alone cannot distinguish between a 3D structure and its mirror image. Therefore, we expect that the original structure of the Hilbert curve and its mirror image should be present in the ISD ensemble with identical probability.

We also computed structures with sparsified versions of the full set of contacts by randomly selecting a smaller number of contacts. In the sparsified data sets, the number of contacts ranged from 3326 to 578 contacts corresponding to 2.8% to 0.44% of all distances (the sparsity of the contact data in case of the ISD bead model of the X chromosome at 500 kb resolution is 0.68%). For all data sets, we obtained an ensemble of structures that comprised two major conformers: an approximate version of the input structure and its mirror image. The average structures of both clusters are shown in

These tests show that (1) ISD is capable of generating statistically valid ensembles, which is indicated by the fact that we obtained a bimodal posterior ensemble, in which the peaks corresponding to the original structure and its mirror image have identical spread and population (see

To better judge the performance of ISD, we also computed structural models with other methods for chromosome structure inference for the Hilbert curve data.

We implemented the ShRec3D [

ShRec3D, MBO and the simulated annealing procedure (SA) by Nagano

An advantage of embedding algorithms such as ShRec3D is that the structural model is obtained by computing an eigendecomposition, which is very fast. However, the speed of the method comes at the prize of being quite restrictive in the prior assumptions that can be imposed on the chromosome fiber. Another serious limitation is that only a single structure is obtained (modulo reflections and rigid transformations). Therefore, it is not possible to make statements about the precision of the reconstructed chromosome structure.

Manifold-based optimization (MBO) [

In contrast to classical MDS, the optimization problem that MBO solves is no longer convex due to a rank constraint, which guarantees that the solutions live in 3D space. As a consequence, there are multiple minima which the optimizer locates. By running the optimization procedure multiple times from randomized initial structures, MBO produces a set of possible chromosome structures. However, the MBO ensemble does not have the same statistical foundation as the ISD ensemble. Because ISD uses a sampling algorithm to explore the posterior distribution, it produces a faithful representation of the information content of the data. This is reflected by a high correlation between the

^{−4} bead radii): MBO is over confident about the reliability of the chromosome structure. The reason is that at this moderate level of data sparsity the optimization problem is still well-defined, such that the optimizer always finds the same minimum structure (and its mirror image). As the completeness of the data dwindles, MBO starts to produce slightly more heterogeneous ensembles, but the ensemble spread is still more than an order of magnitude smaller than the accuracy. For the sparsest data set (0.44% of all distances), the spread of the MBO ensemble is still only 0.14 bead radii. Therefore, the spread of the MBO ensemble is not a meaningful indicator of the reliability of the reconstructed chromosome structure and seems to mostly reflect the difficulty of the optimization problem. The spread of the ISD ensemble, on the other hand, gives a reasonable estimate of the true accuracy of the ensemble.

We also applied the simulated annealing approach by [

One great advantage of embedding methods is that they are many orders of magnitude faster than a full ISD simulation. To give a rough estimate: A single HMC proposal generated by a trajectory of 250 MD steps takes ∼70% of the computation time of running ShRec3D. A typical ISD simulation consists of 50 replicas each of which executes many HMC sampling steps (10 × 250 MD steps between two replica transitions). We typically sample 1000 replica transitions, which amounts to a total number of 50 × 10 × 250 × 1000 = 1.25 × 10^{8} gradient evaluations / MD steps. On a single CPU this would take approximately 3.5 × 10^{5} the running time of ShRec3D. However, one has to keep in mind that running ShRec3D takes less than half a second at 500 kb resolution. Moreover, convergence of the replica simulation is usually very fast (in the order of 20 to 50 replica transitions), such that on a computer cluster a full ISD calculation can be run in one or two hours. If one is only interested in a single structure and does not aim to sample the posterior distribution exhaustively, structural models can be obtained by running only the HMC algorithm (without the overhead introduced by parallel tempering / replica exchange simulation), which takes only a few minutes and is of similar speed as the SA implementation by Nagano

Simulations were performed using an extended version of the ISD library [

Pooled histogram of all distances involved in an experimentally observed contact

(PDF)

(PDF)

(PDF)

(PDF)

(PDF)

For details see caption of

(PDF)

(PDF)

Shown are the correlation coefficients between the average distance matrices (in %) for structure ensembles obtained from cell 1 to cell 6. Lower diagonal: ISD ensemble vs ISD ensemble. Upper diagonal: ISD ensemble vs ensemble by Nagano

(PDF)

Percentage of restraints from different cells (columns) that are violated in structure ensembles (rows) with a tolerance of

(PDF)

(PDF)

(PNG)

Left: population Hi-C map from [

(PDF)

The cumulative number of

(PDF)

(PDF)

Top two rows show the original structure

(PNG)

Tests were run on the 512-Hilbert curve with dwindling number of input data. Each ISD ensemble comprised two major clusters corresponding to the input curve and its mirror image. For each cluster, the accuracy was assessed by computing the RMSD to the input structure

(PDF)

Ubiquitin is shown in the left panels, the HRDC domain is shown in the right panels. The top row shows the average bead models obtained from the mixed set of distances. The bottom row shows the average bead models obtained from mixed sets of sparse contacts. Ubiquitin models obtained with distance / sparse contact information are shown in panels

(PDF)

Structure ensembles were calculated with a reduced set of contacts where the

(PDF)

Tests for the calculation of a single structure were done on a standard laptop with 2.90GHz Core i7 8-core processor with 16 GB memory. At a resolution of 500 kb, the test consisted of 1000 steps of HMC starting from an extended structure. At a resolution 50 kb, 10000 steps of HMC were run starting from an extended structure. To compute the posterior ensemble of the X chromosome, we ran the replica-exchange algorithm on a computer cluster using 50 CPUs. At a resolution of 500 kb, we simulated 200 replica transitions where each transition consisted of 10 steps of HMC each using 250 leapfrog integration steps. At a resolution of 50 kb, we ran 1000 replica transitions on 50 CPUs. It is possible to shortcut the computation of the high-resolution structure by starting from a low resolution model. The initial structure of the high-resolution model is obtained by using a 3D spline interpolation with 10-fold higher sampling. For the single structure calculation, it is sufficient to reduce the number of HMC steps to 1000. In case of the replica simulation, the correct ensemble is obtained after 200 replica transitions. The corresponding computation times are indicated by “500 kb + 50 kb” in the S4 Table.

(PDF)

We thank Ernest Laue and Tim Stevens (University of Cambridge) for useful discussions and sharing data. We also thank Jonas Paulsen (University of Oslo) for help with the MBO code.