The authors have declared that no competing interests exist.

Conceived and designed the experiments: JP OG PC. Performed the experiments: OG. Analyzed the data: JP. Contributed reagents/materials/analysis tools: JP OG. Wrote the paper: JP OG PC.

The three-dimensional (3D) structure of the genome is important for orchestration of gene expression and cell differentiation. While mapping genomes in 3D has for a long time been elusive, recent adaptations of high-throughput sequencing to chromosome conformation capture (3C) techniques, allows for genome-wide structural characterization for the first time. However, reconstruction of "consensus" 3D genomes from 3C-based data is a challenging problem, since the data are aggregated over millions of cells. Recent single-cell adaptations to the 3C-technique, however, allow for non-aggregated structural assessment of genome structure, but data suffer from sparse and noisy interaction sampling. We present a manifold based optimization (MBO) approach for the reconstruction of 3D genome structure from chromosomal contact data. We show that MBO is able to reconstruct 3D structures based on the chromosomal contacts, imposing fewer structural violations than comparable methods. Additionally, MBO is suitable for efficient high-throughput reconstruction of large systems, such as entire genomes, allowing for comparative studies of genomic structure across cell-lines and different species.

Understanding how the genome is folded in three-dimensional (3D) space is crucial for unravelling the complex regulatory mechanisms underlying the differentiation and proliferation of cells. With recent high-throughput adaptations of chromosome conformation capture in techniques such as single-cell Hi-C, it is now possible to probe 3D information of chromosomes genome-wide. Such experiments, however, only provide sparse information about contacts between regions in the genome. We have developed a tool, based on manifold based optimization (MBO), that reconstructs 3D structures from such contact information. We show that MBO allows for reconstruction of 3D genomes more consistent with the original contact map, and with fewer structural violations compared to other, related methods. Since MBO is also computationally fast, it can be used for high-throughput and large-scale 3D reconstruction of entire genomes.

Understanding genomes in three dimensions (3D) is a fundamental problem in biology. Recently, the combination of chromosome conformation capture (3C) methods with next-generation sequencing, such as 5C [

Several approaches have been proposed to take into account the dynamic nature of chromatin and the aggregated nature of the data. Baù et al. [

Another class of methods for identifying 3D chromatin structure from chromosomal contact data relies on reconstructing a “consensus” 3D structure from a (possibly incomplete and noisy) Euclidean distance matrix (EDM) consisting of pairwise distances (in 3D) between different regions in the genome. In general, this EDM is not known, but is typically estimated from the interaction frequency matrix. Given an EDM various optimization approaches that fall under the general topic of multidimensional scaling (MDS) (see e.g. [

The most basic form of MDS is the so-called classical (or metric) MDS, where the optimal coordinate reconstruction from a given EDM is found directly by eigen decomposition of the so-called Gram matrix (see

Other optimization methods applied on MDS problems to find coordinates from incomplete distances exploit the rank constraints on the EDM (or corresponding Gram matrix) to find an optimal EDM for the relevant spatial dimension. One successful method in this respect is based on convex semidefinite programming [

Technological improvements have also facilitated the reconstruction of 3D genome structures. In particular, adjustments to the Hi-C protocol have been introduced to enable identification of interactions between chromosome regions in single cells [

One way to handle these limitations is to replace missing distances with their ‘shortest-path’ equivalence; that is, considering the existing (observed) entries in the EDM as weighted edges in a graph, and replacing each missing edge weight with the smallest possible sum of weights traversing the graph along the observed edges [

An application of single-cell like contact maps coupled with missing-value imputation using the shortest-path method and classical MDS to find 3D coordinates, was recently proposed [

Another approach proven to be effective on many optimization problems relies on optimization on manifolds. The problem of finding optimal coordinates from an EDM can be formulated as an optimization problem on the manifold of the set of positive semidefinite matrices of fixed rank [

In this paper, we show that the manifold based optimization (MBO) approach can be successfully applied to 3D genome reconstruction. MBO significantly outperforms the simpler methods based on classical MDS in terms of consistency with the original contact map and structural violations, while remaining sufficiently efficient to handle large-scale problems.

Using both simulated and real single-cell Hi-C data, we show that, by combining the shortest-path derived distances with appropriate weights to reduce the influence of noise, MBO can efficiently reconstruct 3D structures consistent with the chromosome contact maps, despite the noisy and sparse nature of the data. Our implementation of the manifold optimization method is based on the Manopt software [

In the following sections, we apply MBO to reconstruct the 3D structure of genomes in two types of settings, and compare to two other popular approaches. First, to evaluate the method’s ability to reconstruct a known 3D structure, we consider a given

Given a matrix of interaction frequencies, typically from a Hi-C or single-cell Hi-C data set, we seek to reconstruct the corresponding 3D coordinates of the genome structure. In classical MDS (CMDS), this reconstruction is performed by converting the contact frequencies into an EDM (

A: Original chromosomal contact map (_{ij}) based on chromosome X from cell 1. A blue dot indicates the presence of an observed interaction in the single-cell Hi-C data set. B: Distance matrix (_{ij}) consisting of Euclidean distances (in _{ij}), containing numbers between 0 and 1 giving the weight for each of the distances in the Euclidean distance matrix to the left. See the

In our method, which relies on manifold based optimization (MBO) [

As a first validation of the method, we have considered an

The structure considered in this validation is a 3D model of mouse haploid chromosome X generated from single-cell Hi-C data by Nagano et al. [

(1–Spearman rank correlation) between the original and reconstructed distances in a single structure of chromosome X from [

For the weakly noisy case (

In the noiseless case (^{6/5}

We inspected the ability of MBO to reconstruct the considered orignal structure when the missing distances approach this limit. The original structure can be exactly reconstructed with up to ∼ 90% missing data (

A: Original chromosome X structure from [

To inspect this dependency further, we calculated the minimum ratio of observed distance values needed for complete reconstruction ([1-

Typical computation times for the methods considered in the validation performed above are shown in

Computational time (in seconds) for reconstructing a single chromosome structure using three different algorithms CMDS (dark blue), ChromSDE (green), and the MBO algorithm (red) presented here. For comparison, the shortest path algorithm (light blue) is also shown. The computational time is shown as a function of structure size

Next, we examined the ability of MBO and CMDS to reconstruct contact maps for the full set of chromosomes, based on single-cell Hi-C data [

We started by considering chromosome X, where only one copy is present in the data. For chromosome X, we found that MBO was able to reconstruct the original contact map of the haploid X nearly completely (both cases > 95% reconstructed). CMDS, on the other hand, was not able to reconstruct the contact matrix of chromosome X at more than ∼50–60% correct contacts (Figs

A: Original contact map. Blue dot indicates the presence of a contact in the single-cell Hi-C data set for chromosome X (cell 1). B: Contact map obtained after 3D reconstruction using MBO, based on the contact map (in A) and then re-calculating the contacts. C: Reconstructed contact map, as in B, but using CMDS. D: Reconstructed 3D structure using MBO, corresponding to the contact map in B. E: Reconstructed 3D structure using CMDS, corresponding to the contact map in C. Each bead in D and E has a diameter of 150 nm. Lines represent connected beads with missing bead position information.

Consistency of the structures obtained from reconstructing all chromosomes for cell 1 (left) and 2 (right) using MBO (blue) and CMDS (red). A: Reconstruction accuracy, given as the percent correct contacts when comparing original and reconstructed contacts maps for different chromosomes. B: Distance violation, given as the occurrence (in percent) of regions in the structures that are below the minimum distance (at 30 nm). C: Connectivity violation, given as the occurrence (in percent) of consecutive regions in the structures that are further away than the maximum distance (200 nm). Blue bars indicate the performance of MBO, while red bars indicate the performance of CMDS.

Interestingly, for homologous chromosome pairs, where two chromosome copies are present, reconstruction was not as consistent with the input contact maps as for chromosome X, as only ∼20% of the contacts in the original maps could be reconstructed (

We were interested in investigating the effect of having possibly mutually exclusive contact information from two separate chromosome X structures from cell 1 and cell 2. We therefore randomly sampled 50 new datasets consisting of an equal number of contacts from the matrices from these two cells and inspected the ability of MBO to reconstruct structures corresponding to the resulting contact maps. As

For homologous chromosome pairs, MBO and CMDS performed similarly in terms of percentage of successfully established interactions (

Since MBO, like most optimization strategies for structural reconstruction, is non-convex, optimized structures might depend on the random starting configuration of the optimization. We wanted to study this effect by running 100 independent optimizations of chromosome X using different random initialization of the starting configurations. We then calculated the root-mean-square deviation (RMSD) between the resulting superimposed structures, and found a high degree of similarity between all the 100 chromosome X structures, with an average RMSD of ∼ 322 nm, similar to what was reported in [

The heatmap shows clustered RMSD values between 100 independent optimizations with random initial configurations prior to using MBO on chromosome X. The dendrogram above shows the result of the hierarchical clustering based on the RMSD values. At the bottom, 5 example structures from each cluster are shown.

In

For comparison reasons, we applied MBO using a weighting scheme where the shortest-path completed matrix was used directly without accompanying weights. In

All in all, we have shown that MBO reconstructs 3D structures consistent with the input chromosomal contact data, at the same computational speed as the popular CMDS approach. Additionally, MBO imposes fewer violations relating to the connectivity of the chain, as well as fewer violations from placing regions too close to each other. We have shown that MBO can be used for routine reconstruction of 3D structures from sparsely sampled data, such as single-cell Hi-C.

In contrast to methods such as MCMC and molecular dynamics, methods aiming at reconstructing a single consensus 3D structure can be utilized quickly and in a high-throughput fashion. One challenge with such approaches, however, has been the lack of possibilities for handling the sparse and noisy interaction frequency matrices in a flexible and robust way. In this paper, we have shown that combining weights with manifold based optimization (MBO) allows for reconstructing 3D structures of genomes, even when data are sparse and noisy, such as for single-cell Hi-C. We have shown that the weights allow for prioritization of interactions where information about spatial positioning is found, while allowing the remaining regions to be positioned in a consistent fashion. Specifically, by comparing the reconstructed and original contact maps, we have shown that the single copy of chromosome X in male mouse cells can be reconstructed in a fashion consistent with the input data. For homologous chromosome pairs, however, reconstruction was not complete, most likely due to considerable structural difference between the two chromosome copies.

We note that it is also possible to run MBO on ensemble Hi-C datasets, since the weighing scheme is equally applicable in this case. However, the assumption of a consensus structure would in this case probably be less justifiable, due to the known inherent variability in chromosome interactions across cells in a large population.

As chromosome conformation capture data are becoming increasingly available [

A fundamental problem relevant for many applications in various disciplines is to find some coordinates, _{i} ∈ ℝ^{r}, ^{n×n}, q
_{i}.

If the EDM is known exactly (without noise or missing entries), the coordinates _{i} can be uniquely determined up to arbitrary rotations and translations by introducing the matrix ^{n×n},
^{n×n} is the identity matrix and ^{n} is a vector of all ones. If ^{T}, where ^{r×r} is the diagonal matrix with the ^{n×r} is the matrix with the _{i} as its rows. It is easy to see that ^{T}, thus

In many practical applications, however, the EDM may contain noisy and missing values. In this case, finding optimal coordinates _{i} must be treated as an optimization problem of finding coordinates that minimize some cost function based the known distances. If all pair-wise distances between points are known, but not necessarily accurately, one solution to the optimization problem is given in terms of classical multidimensional scaling (CMDS). CMDS basically solves the optimization problem of finding a matrix

A manifold based optimization approach for the completion of Euclidean distance matrices was recently presented in Mishra et al. [^{n}(

For the application of this approach to the case of the 3D genome reconstruction we have applied a slightly more general framework where the weights are allowed to take any non-negative values (not restricted to 0 and 1). In addition, we choose to minimize the differences between the ordinary Euclidean distances rather than the squared distances used in ^{n}(^{T}, where ^{n×r} and

To minimize

Manopt includes a number of different solvers for the optimization problem. Here we will employ a trust-region solver which, unlike steepest descent, utilizes information about both the gradient and the Hessian of the cost function, and has been shown to have good convergence rates. The gradient of ^{(2)} = ^{T}) ⊙

In addition to the gradient the trust-region algorithm also requires the Hessian in a given direction

From the known 3D structure. a true EDM was constructed containing the pair-wise squared distances between all the 600 kbp sized bins. To model the uncertainty and possible sparsity of distance information inferred from chromosomal contact data such as Hi-C, the original distance matrix was contaminated by adding random noise as well as randomly removing a given percentage of the distances. That is, from the original Euclidean distance matrix _{ij} is generated as
_{ij} are sampled randomly from a standard normal distribution and where 𝓝 is the set of entries (

Tests were run for different values for the noise level

The raw results from a single-cell Hi-C experiment typically lists a number of observed contacts between specific genome positions. From the raw results, the contacts were aggregated into equally spaced bins along the chromosomes. For the results presented here a bin size of 50 kbp was used. Then all observed contacts were assigned to their corresponding bins. In the case that multiple contacts fell into the same bin, the duplicate entries were ignored so that a binary contact matrix _{ij} was obtained for each chromosome. Hence, _{ij} = 1 represents a Hi-C contact between bins _{ij} = 0 represents the absence of a contact.

To use the MBO approach, the binary contact map must be converted into a distance matrix _{ij}. First a target distance _{c} is assigned to all bins with an observed Hi-C contact. Secondly, the connectivity along the chromosome is taken into account by assigning a distance _{n} to neighboring bins. Hence, as a first step the following matrix is constructed

Since the MBO method works also for incomplete distance matrices, the optimization could in principle be run directly on _{ij} be nonzero only for the nonzero entries of _{ij}. However, since only the local distances (contacts and neighboring bins) are known, a direct optimization of

Thus, we first replace the zero entries in _{ij} with the shortest-path derived distances. We then introduce the weight matrix _{ij} whose elements are chosen to be inverse proportional to the number of edges traversed in the shortest path, i.e _{ij} is the number of edges that is needed to connect node

MBO is initialized by starting with a random initial configuration (a random point on the manifold), and convergence is considered obtained if the cost function or the norm of the gradient drops below a small value (1e-20 and 1e-08, respectively). After a successful convergence of the optimization algorithm the resulting coordinates _{i} are scaled to best agree with the original contact map. That is, we search for a scaling constant _{l} so that _{c} pair-wise distances smaller than the contact distance _{c}, where _{c} is the number of contacts in the original contact matrix. Note that in the case of perfect agreement, the contact matrix derived from the coordinates _{l} _{i} will be identical to the original contact matrix, since the number of entries are the same. The optimal value for _{l} is found by a simple binary search method.

The percent correct contacts was calculated by direct comparisons of original and reconstructed contact matrices. Minimum distance violations were defined as the percent fraction of pairwise distance below 30 nanometers. Connectivity violations were defined as the percent fraction of neighboring (connected) bins with a distance above 200 nanometers. In _{c} = 60nm, _{n} = 120nm.

MBO is implemented in Matlab, and is based on the Manopt software [

Consistency of the structures obtained from reconstructing all chromosomes for cell 1–10 using MBO (blue) and CMDS (red). Top panel: Reconstruction accuracy, given as the percent correct contacts when comparing original and reconstructed contacts maps for different chromosomes. Middle panel: Distance violation, given as the occurrence (in percent) of regions in the structures that are below the minimum distance (at 30 nm). Bottom panel: Connectivity violation, given as the occurrence (in percent) of consecutive regions in the structures that are further away than the maximum distance (200 nm). Blue bars indicate the performance of MBO, while red bars indicate the performance of CMDS.

(PDF)

Left panel: Reconstruction accuracy, given as the percent correct contacts when comparing original and reconstructed contacts maps. Right panel: Connectivity violation, given as the occurrence (in percent) of consecutive regions in the structures that are further away than the maximum distance (200 nm). Red dots corresponds to a 3D reconstruction of chromosome X from cell 1, and blue dots corresponds to a 3D reconstruction of chromosome X from cell 2. The purple circles correspond to optimizations from 50 independent randomly sampled data sets with equal amounts of contacts from cell 1 and cell 2. The thick purple line indicates the median, while the thin purple lines indicates the 25th and 75th percentiles.

(PDF)

A: Reconstructed 3D structure using MBO, where each bin is represented as a bead with a diameter of 150 nm. B: Same reconstructed 3D structure as in A, but where each bin is connected by a line to show the trace of the chromosomal structure. C: Original contact map. Blue dot indicates the presence of a contact in the single-cell Hi-C data set for chromosome 1 (cell 1). D: Contact map obtained after 3D reconstruction using MBO and then re-calculating the contacts.

(PDF)

The heatmap shows clustered RMSD values between 100 independent optimizations with random initial configurations prior to using MBO on chromosome 1. The dendrogram above shows the result of the hierarchical clustering based on the RMSD values. At the bottom, 5 example structures are shown.

(PDF)

Consistency of the structures obtained from reconstructing all chromosomes for cell 1 (left) and 2 (right) using MBO without weights (blue) and CMDS (red). Top panels: Reconstruction accuracy, given as the percent correct contacts when comparing original and reconstructed contact maps for different chromosomes. Middel panels: Distance violation, given as the occurrence (in percent) of regions in the structures that are below the minimum distance (at 30 nm). Bottom panels: Connectivity violation, given as the occurrence (in percent) of consecutive regions in the structures that are further away than the maximum distance (200 nm).

(PDF)

Consistency of the structures obtained from reconstructing all chromosomes for cell 1 using MBO (blue) and MBO with squared distances in

(PDF)

Same as

(PDF)

A: To find the optimal

(PDF)

Column 1: chromosome, column 2: optimized

(TXT)