^{1}

^{2}

^{3}

^{4}

^{5}

The authors have declared that no competing interests exist.

Many students are taught about genome assembly using the dichotomy between the complexity of finding Eulerian and Hamiltonian cycles (easy versus hard, respectively). This dichotomy is sometimes used to motivate the use of de Bruijn graphs in practice. In this paper, we explain that while de Bruijn graphs have indeed been very useful, the reason has nothing to do with the complexity of the Hamiltonian and Eulerian cycle problems. We give 2 arguments. The first is that a genome reconstruction is never unique and hence an algorithm for finding Eulerian or Hamiltonian cycles is not part of any assembly algorithm used in practice. The second is that even if an arbitrary genome reconstruction was desired, one could do so in linear time in both the Eulerian and Hamiltonian paradigms.

When you learned about genome assembly algorithms, you might have heard a story that goes something like this:

In this paper, we explain that while de Bruijn graphs have indeed been very useful, the reason has nothing to do with the complexity of the Hamiltonian and Eulerian cycle problems.

We will first define the terms necessary to understand the above story. A Hamiltonian cycle in a graph is a cycle that visits every vertex at least once, and an Eulerian cycle is a cycle that visits every edge once. In general graphs, the problem of finding a Hamiltonian cycle is NP-hard, while finding an Eulerian cycle is solvable in polynomial time. Consider a set of reads

Every Eulerian cycle in a de Bruijn graph or a Hamiltonian cycle in an overlap graph corresponds to a single genome reconstruction where all the repeats (long sequences that appear more than once) are completely resolved (i.e., their place in the genome determined). For example,

Here, the set of all k-mers is ^{k}(_{1} = ^{k}(_{1} (in blue). Panel B show the only other Eulerian cycle in _{1} (in orange). The genome reconstruction corresponding to the blue cycle is

Instead, assemblers output contigs—long, contiguous segments which can unambiguously be inferred to be part of the genome. Finding such segments is a very different computational problem than finding a single Eulerian or Hamiltonian cycle (see Endnote 1). In fact, it was shown that finding all possible contigs can be done in polynomial time, regardless of whether the genome reconstruction is modeled as a Hamiltonian or Eulerian cycle [

Perhaps you are not convinced by the above reasoning? Fine. For the sake of argument, let us imagine that we really are interested in finding a single, arbitrary, genome reconstruction. But even in this case, the distinction between Eulerian and Hamiltonian cycles is misleading. We make our point with this theorem, which we first state informally (a formal statement and proof will come later):

Find an Eulerian cycle in the de Bruijn graph where the edges correspond to k-mers in the reads.

Find a Hamiltonian cycle in the de Bruijn graph where the edges correspond to all the possible (k+1)-mers that can be obtained from the reads’ k-mers.

The first part of the theorem should not be surprising. It states one half of the story we started with, namely that we can solve the assembly problem in linear time by finding an Eulerian cycle in a de Bruijn graph. The second part of the theorem, though, adds a twist. It is about finding a Hamiltonian cycle, but it differs from the initial story in 2 ways. First, it is a Hamiltonian cycle in a de Bruijn graph, not in an overlap graph. This might seem strange, but there is no special connection between overlap graphs and the Hamiltonian cycle problem—one is free to find a Hamiltonian cycle in any graph they wish. Second, the problem is solvable in linear time in this case, even though it is NP-hard in general. This might also seem strange, but in fact it is common for NP-hard problems to have polynomial time solutions for a restricted class of inputs. For example, the satisfiability problem is NP-hard in general but is polynomial time solvable when the clauses are restricted to have only 2 variables.

What the theorem states, then, is that one can solve the assembly problem in linear time by finding a Hamiltonian cycle within an appropriately defined de Bruijn graph. The fact that the Hamiltonian cycle problem is NP-hard in general graphs is not directly relevant. What is important is the underlying structure of the de Bruijn graph which makes the Hamiltonian cycle problem easy to solve (for more details, see Endnote 2). Hence, the initial story was right in the sense that using de Bruijn graphs is a good idea but wrong to imply that the complexity of the Hamiltonian cycle problem is a reason. All of this is of course assuming we are, for some reason, interested in an arbitrary genome reconstruction, which, as we argued earlier, we typically are not.

We will now give some definitions to prove the main theorem. Let _{i}(_{i}(^{k}(^{k}(^{k−1}(_{k−1}(_{k−1}(_{k}(_{k}(

The main theorem follows almost directly from definitions. The proof here is based on first principles, for expository purposes, but it is actually a corollary of deeper results (see Endnote 4).

^{k}(^{k}(^{k+1}(^{k}(^{k}(

^{k}(_{1} = ^{k}(_{2} = ^{k+1}(_{2} is ^{k}(^{k}(^{k}(^{k}(_{2} is

Observe that a sequence of k-mers _{1} = _{0},…,_{n−1} is a sequence of edges defining an Eulerian cycle in _{1} if and only if the set of k-mers of _{1} is exactly _{k−1}(_{i}) = _{k−1}(_{i+1 mod n}). Also, observe that a sequence of k-mers _{2} = _{0},…,_{n−1} is a sequence of vertices defining a Hamiltonian cycle in _{2} if and only if the exact same criteria holds, i.e., the set of k-mers of _{2} is exactly _{k−1}(_{i}) = _{k−1}(_{i+1 mod n}). Thus, there is a one-to-one correspondence between Eulerian cycles in ^{k}(^{k+1}(

For the running time, an Eulerian cycle can be found in time linear in the number of edges using a classical algorithm, e.g., Hierholzer’s Algorithm. Giving the one-to-one correspondence above, the vertex labels of a Hamiltonian cycle in _{2} can be found by outputting the edge labels of an Eulerian cycle in _{1}. Hence, the running times for the 2 problems are equivalent.

So why are de Bruijn graphs so popular for short read assembly, if not for the difference in the complexity of finding Eulerian or Hamiltonian cycles? The answer is complex, which might explain why the initial simple story was appealing. It may have to do with the simplicity of their implementation, the appeal of the k-mer abstraction, the ease of error correction, or with something else. In fact, the difference between using de Bruijn graphs and overlap graphs is poorly understood and is a fascinating open research problem. But, the Eulerian and Hamiltonian cycle dichotomy is not really relevant to assembly or to the popularity of de Bruijn graphs.

For those who are curious, the problem of finding all possible contigs was first mentioned in [

There is much more that has been said on the complexity of finding a single genome reconstruction. For a starting point, see the papers [

The definition of de Bruijn graph that we give here is sometimes referred to as the edge centric dBG, not to be confused with a node centric one (see

The main theorem can be viewed as a corollary of a similar result for full de Bruijn graphs (see Endnote 3) and of the relationship between Eulerian cycles in a digraph _{2} is the line digraph of _{1} and then using the same type of argument as de Bruijn [

PM would like to thank Rayan Chikhi for feedback on the text and Alexandru Tomescu and Michael Brudno for many helpful discussions on this topic (and Michael specifically for introducing him to the problem a long time ago).