^{1}

^{2}

^{3}

^{2}

^{3}

The authors have declared that no competing interests exist.

With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers

High-throughput sequencing data has been accumulating at an extreme pace. The need to efficiently analyze and process it has become a critical challenge of the field. Many of the data structures and algorithms for this task rely on

The pace of high-throughput sequencing keeps accelerating as it becomes cheaper and faster and with it the need for faster and more memory efficient genomic analysis methods grows. The NIH Sequence Read Archive, for example, currently contains over 12 petabases of sequence data and is growing at a fast pace. Increased use of sequence-based assays (DNA sequencing, RNA-seq, numerous other “*-seq”s) in research and in clinical settings creates high computational processing burdens. Metagenomic studies generate even larger sequencing datasets. New fundamental computational ideas are essential to manage and analyze these data.

The minimizer approach has been extremely successful in increasing the efficiency of several sequence analysis challenges. Given a sequence of length

Here, we generalize and improve on the minimizer idea. To avoid dependence on a particular sequence _{k,L} is called a _{k,L}. The set of all

A small UHS has a variety of applications in speeding up genomic analyses since it can be used where minimizers have been used in the past. For example:

^{2}) pairs of reads to see whether they overlap (where _{k,L} in this overlapped region. By bucketing reads into bins according to the universal _{k,L}|.

_{k,L}, we can instead store only positions in _{k,L}. Any query with |

_{k,L} is stored, any window of length ≥

Minimizers have been used for some of these and similar applications [

A small UHS, if it can be found, has a number of advantages over minimizers for these applications:

The set of minimizers for a given collection of reads may be as dense as the complete set of ^{k} for an alphabet Σ), whereas we show that we can often generate UHSs smaller by a factor of nearly

For any

The hash buckets, sparse suffix arrays, and Bloom filters created for different datasets will contain a comparable set of

One does not need to look at the reads or to build a dataset-specific de Bruijn graph in order to decide which

Problem 1 can be rephrased as a problem on the complete de Bruijn graph of order

_{k} of order _{k,L} such that any path in _{k} of length ℓ = _{k,L}.

Here and throughout, the length of a path is the number of

The software to compute small UHSs is freely available at

Throughout this paper,

^{k} vertices in a de Bruijn graph, each representing a unique

Every path in a de Bruijn graph represents a sequence. A path _{0}, _{0}, _{1}, _{1}, _{2}, …, _{n} of length _{i} occurs in _{i} occurs in

We define terminology for

^{L} ∣ ^{L} is the set of all

The universal set of hitting _{k,L} which satisfies _{k,L}, ^{L}.

It is not known how to efficiently find a minimum universal (

The problem of finding a minimum-size ^{k}/^{k}).

An unavoidable set of constant length

Unfortunately, finding an unavoidable set is not enough, as there may be

Our initial algorithm is based on the greedy algorithm for the minimum hitting set [

Specifically, let

The calculation of ^{k+1} ⋅

The full algorithm combines the two steps. First, we find a decycling set in a complete de Bruijn graph of order

1: Generate a complete de Bruijn graph

2: Find a decycling vertex set

3: Remove all vertices in

4:

5: Calculate

6: Calculate

7: Remove a vertex with maximum hitting number from

8:

9: Output set

Finding the decycling set takes ^{k}). In the second phase, each iteration calculates the hitting number of all vertices in time ^{k+1}^{k+1}

The exponential dependence of DOCKS on

In order to extend the range of

DOCKSany has the same structure as DOCKS, but with one difference: it removes a node _{1} ≤ … ≤ _{n}. Define

A vertex

1: Generate a complete de Bruijn graph

2: Find a decycling vertex set

3: Remove all vertices in

4:

5: Calculate

6: Calculate the number

7: Remove a vertex

8:

9: Output set

Computing ^{k+1}). The time for computing ^{k+1}) time. Computing the longest path in a DAG (step 4) also requires ^{k+1}). If ^{k+1}), a factor of Θ(^{k+1}) vs. ^{k+1}) for DOCKS.

In addition to shorter runtimes and decreased memory usage, this heuristic offers one more advantage over the original DOCKS algorithm. The vertex removal choice is independent of

Finally, in order to calculate the hitting set for even larger

To investigate whether optimal solutions can be found practically, we formulate the problem of the minimal universal ^{k} binary variables _{i} representing whether vertex ^{k} variables _{i} representing an upper bound on the number of edges in the longest path ending at vertex _{i} guarantee that the vertices chosen remove all

Here ^{k+1} possible edges. The constraint on edge (_{v} ≥ 1 + _{u}. The validity of this formulation is proven in the Appendix (see Appendix, Subsection Validity of the ILP formulation in

The number of variables and constraints grows exponentially in

The DOCKS variants described above have exponential dependence in _{K, L} and integer _{k+j, L+j} by concatenating all possible _{k, L}. Formally,
_{k+j, L+j} is a universal (_{k, L} that hits _{k+j, L+j} contains all (

For example, by appending all possible 10-mers to each 10-mer in _{10,20} we obtain _{20,30}. The size of the set _{10,20} is |_{10,20}| = _{10}, where _{20,30} is _{20,30}| ≈ 2_{20}, i.e. the approximation factor doubled.

For a given

A decycling set necessarily contains a _{k, L} has a size ≥ ^{k}/

_{max}, the length of the longest sequence in a complete de Bruijn graph after a minimum decycling set computed using Mykkeltveit’s algorithm is removed, for

For each value _{max} of the longest sequence, represented as a longest path, was calculated.

2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |

_{max} |
5 | 11 | 20 | 45 | 70 | 117 | 148 | 239 | 311 | 413 | 570 | 697 | 931 |

We implemented and ran DOCKS over a range of

The results are summarized in

For different combinations of ^{k}. The broken lines show the decycling set size for each

We ran DOCKSany for 5 ≤

The histogram shows the size of the universal sets generated by DOCKS, DOCKSany, and DOCKSanyX with

We tested the performance of DOCKSanyX for ^{i} for 0 ≤

We solved the ILP using Gurobi 6.5.2 [

For each combination of 5 ≤

The minimizer algorithm [

The genome sizes are quoted after removing all

Species | Genome size (Mbp) | Method | # mers (thousands) | avg. dist. |
---|---|---|---|---|

0.393 | lexicographic | 32.9 | 9.48 | |

randomized | 28.0 | 11.0 | ||

DOCKS | 23.7 | 12.4 | ||

4 | lexicographic | 114.0 | 10.2 | |

randomized | 89.6 | 11.0 | ||

DOCKS | 66.0 | 12.4 | ||

100 | lexicographic | 286.0 | 8.83 | |

randomized | 277.0 | 11.0 | ||

DOCKS | 145.0 | 12.4 | ||

2900 | lexicographic | 543.0 | 9.13 | |

randomized | 389.0 | 10.9 | ||

DOCKS | 154.0 | 12.1 |

We presented the DOCKS algorithm, which generates a compact set of

We see the benefit of our compact UHSs in many data structures and algorithms that analyze high-throughput sequencing data. For example, we expect that binning-based

The good performance of the algorithms can be attributed to their two phase approach. In the first phase we optimally and rapidly remove a minimum-size set that hits all infinite sequences, which also takes care of many

We developed two additional variants of DOCKS that reduce the runtime and memory usage at the price of increasing the size of the set created. DOCKS can provide a solution for

Our approaches are heuristic in nature. This is not surprising, since as we show, the problem of finding a minimum (

Our study raises several open problems. First, is there a characterization for a minimum universal (^{k}/_{k, L}?

We demonstrated the ability of DOCKS to generate compact sets of

The table contains solution set size, time in seconds and memory in KB for DOCKS, DOCKSany, DOCKSanyX and the greedy approach algorithms. Note that the reported times are for individual runs of each (

(XLSX)

(PDF)

Part of this work was done while Y.O., R.S. and C.K. were visiting the Simons Institute for the Theory of Computing.