^{*}

Conceived and designed the experiments: BB. Performed the experiments: BB. Analyzed the data: BB. Contributed reagents/materials/analysis tools: BB. Wrote the paper: BB.

BB is an employee of Constellation Pharmaceuticals. This does not alter the author's adherence to all the PLoS ONE policies on sharing data and materials.

In living cells, DNA is packaged along with protein and RNA into chromatin. Chemical modifications to nucleotides and histone proteins are added, removed and recognized by multi-functional molecular complexes. Here I define a new computational model, in which chromatin modifications are information units that can be written onto a one-dimensional string of nucleosomes, analogous to the symbols written onto cells of a Turing machine tape, and chromatin-modifying complexes are modeled as read-write rules that operate on a finite set of adjacent nucleosomes. I illustrate the use of this “chromatin computer” to solve an instance of the Hamiltonian path problem. I prove that chromatin computers are computationally universal – and therefore more powerful than the logic circuits often used to model transcription factor control of gene expression. Features of biological chromatin provide a rich instruction set for efficient computation of nontrivial algorithms in biological time scales. Modeling chromatin as a computer shifts how we think about chromatin function, suggests new approaches to medical intervention, and lays the groundwork for the engineering of a new class of biological computing machines.

Computer programs and logic circuits have often been used as metaphors for the function of cells

A computer implements a set of rules that operate on memory. A formal definition of computation was invented by Turing, whose theoretical machine could read and write symbols on an infinitely long tape according to a finite set of rules

Several authors have shown that DNA can be used to simulate a Turing machine

In 1994, Adleman made headlines with a DNA computer that solved an instance of the NP-complete Hamiltonian path problem

Other forms of biomolecular computation include chemical kinetics, membrane computing, pi-calculus and the blob model

Transcription factor control over gene expression is often expressed as a logic circuit: a combination of AND, OR and NOT operations on Boolean values – consider, for example, the lac repressor

Thus, existing models of biological computation are either powerful computationally but impractical, or not universal – and in neither case are they easy to program.

In living cells, DNA is packaged along with protein and RNA into chromatin. DNA methylation has long been associated with control, and particularly repression, of gene transcription

Chromatin-reading and -writing proteins operate as components of molecular complexes that read and write multiple marks in a combinatorial fashion. These complexes often include transcription factors that recognize specific DNA sequences, as well as effector units that carry out gene transcription or other functions, and scaffolding proteins or RNA to bring the right components together into the complex. The phenomenon of engaging multiple marks at once is often referred to as “multivalency” of chromatin modifiers, or “cross-talk” between combinatorial marks

These rules may operate sequentially on chromatin at a particular location. For example, in animal development, the DNA methylation pattern is erased in the early embryo

Here I present a new computational system, in which chromatin is the writable memory and chemical modifications are the written symbols. Read-write rules model the molecular complexes that recognize and place specific combinations of DNA and histone modifications. The formalism can be easily “programmed” to solve problems such as the NP-complete Hamiltonian path problem, either by the same massively parallel guess-and-check approach of Adleman, or by a more deterministic algorithm that traverses the search tree, with backtracking.

I prove that chromatin computers are Turing-complete by using one to simulate a Turing machine. The mapping to a Turing machine is not forced, but uses components whose complexity is no greater than that of biological chromatin. I implement a script to simulate execution of chromatin computer programs. I show that biological chromatin has many features that provide computational efficiency, such as parallelism, nondeterminism, addressable memory, modification of the program during computation, and topological shortcuts. The chromatin computer formalism is thus both a natural model of biological chromatin, and a powerful language in which to write computer programs.

A chromatin computer (CC) has a set of read-write rules that operate non-deterministically on

A CC is defined by the tuple

The CC operates non-deterministically on an input chromatin configuration, which is marked everywhere by B, except for a finite number of nucleosomes which may have other marks. At each step, the read portion of zero or more rules will match at various locations along the chromatin tape. One matching rule is selected at random and applied to update the modifications at that location. If no rule matches at any location on the chromatin, then the CC halts.

The left hand side of each rule is a read specification for all or some of the marks at

Chromatin consisting of nucleosomes that each have

This diagram illustrates the operation of the rule

In 1994 Adleman created a DNA-based solution to an instance of the Hamiltonian path problem. Let us tackle the same problem, shown in

Figure from Adleman 1994 (5). In the pictured directed graph, there is a unique Hamiltonian path from vertex 0 to vertex 6: 0

The Hamiltonian path problem asks whether there exists a path in a directed graph from the input vertex to the output vertex, visiting each of the other vertices exactly once. Adleman synthesized 20-mer oligonucleotides representing the vertices and edges in the graph. The sequence of an edge's 20-mer was complementary to the appropriate halves of its two vertices' 20-mers. These 20-mers were mixed together and ligated, resulting in double-stranded DNA representing valid paths through the graph. Further sizing and affinity purification steps ensured that each node was represented once and only once in the soup of path-representing oligonucleotides. The sequence of nodes in the correct path was determined using PCR and running the products on a gel. The number of starting 20-mers was large enough that production of the correct path was highly likely.

Our first implementation of a solution to this problem using a chromatin computer will employ a similar guess-and-check approach, by randomly constructing many paths of up to 7 nodes, and signaling success only for a path meeting the requirements. The solution uses 6-chromatin: each nucleosome has six read/write positions. Each rule looks at two adjacent nucleosomes, and there are 10 possible marks, so the CC is a (10,6,2)-CC. One position in each nucleosome represents the vertex number, and the remaining five are used to check that the path contains one and only one visit to each vertex.

(A) Application of a rule to the starting configuration. The chromatin tape is shown as a set of 7 nucleosomes, each with 6 writable positions. The top row shows the initial tape configuration; the bottom row shows the configuration after the application of the rule

Additional explanation, the full rule set and a perl script to simulate the chromatin computer are provided in

A Turing machine is defined by its finite set of rules; each rule specifies a mapping from a symbol and state to a new symbol, a new state, and a movement left or right along the memory tape. A configuration of a Turing machine comprises a machine state, a location of a read/write head on the infinite memory tape, and the contents of the tape. Initially, the tape is blank except for symbols written at a finite number of cells. At each step in the computation, the rule corresponding to the symbol at the current tape cell and the current machine state is applied, and specifies the writing of a new symbol at the current tape cell, a new machine state, and a movement left or right along the tape. If no rule applies, the machine halts.

To prove that a chromatin computer can simulate a Turing machine, I define a reversible mapping from any Turing machine to a chromatin computer, and from each Turing configuration to a chromatin configuration. I then show by induction that running the chromatin computer results in a chromatin configuration that maps back to the Turing configuration that would have been achieved by running the Turing machine, and that the chromatin computer halts whenever the Turing machine halts. The trick to the mapping is to transform each Turing tape cell to a nucleosome, with extra nucleosome positions to store the current state and the current location of the read/write head. Moving left or right along the Turing tape is accomplished on the chromatin by moving these state and head-location marks to adjacent nucleosomes.

(A) Turing machine finite state machine with three rules that rewrite the string “xy” to “zz”. (B) The corresponding chromatin computer. The first position in each 3-position nucleosome corresponds to the location of the Turing read/write head. The second position corresponds to the state of the Turing machine. The third position corresponds to a cell on the Turing tape.

Computer scientists have developed many models of computation that are far more efficient and easier to program than the basic single-head non-deterministic Turing machine. These variations are no more powerful than a Turing machine from a computability standpoint. Non-deterministic Turing machines allow more than one rule with the same left-hand (read) side, and therefore many possible computational paths. Parallel Turing machines have multiple read-write heads, all operating simultaneously. Multi-tape Turing machines have several tapes and corresponding read/write heads. Random access machines allow incrementing and decrementing values in addressable registers. Indirect addressing allows a memory address to be operated on as data. Stored procedure models allow the program itself to be specified as input. Modern computer programming languages are no more capable than a Turing machine of solving a problem, but they can be programmed far more easily, and use fewer computational steps.

Just as real computer languages are more practical than Turing machines, biological chromatin implements many efficiencies either available in our initial CC model, or easily added to it. These efficiencies are powerful; they are exploited by living cells and make programming a simulated CC much easier. Some are familiar concepts from computer science; others are less familiar and quite interesting as computational tricks.

The CC model is nondeterministic, although any particular CC may, by virtue of its rule set, be deterministic. The CC formulation encourages us to ask the question of whether, in a cell, more than one expressed chromatin-modifying complex could match and operate at a given location in a particular configuration of biological chromatin, or whether the design is deterministic. In order to implement consistently repeatable behavior, it seems likely that biological computation has constrained non-determinism in the sense that a given starting chromatin configuration with a given rule set is likely to evolve in a fairly consistent manner upon repeated runs, even if the details of the precise order of rule application at different locations may change from one run to another. This will be an interesting area for future work in modeling biological chromatin modifying complexes.

There are many copies of chromatin-modifying complexes present in the cell, and they operate in parallel throughout the genome. Parallel rule application is readily handled by modifying the definition of a CC to allow not just one, but any number of non-overlapping, matching rules to be applied at each step. To capture the number of physical copies of a complex, the definition of

Some chromatin-modifying complexes, such as those containing RNA polymerase, are known to operate sequentially along the genome. While this can be programmed in our current CC model by having a special mark representing the current location of a rule, it can also be efficiently handled by augmenting the model to allow the right-hand side of a rule to have an additional field for movement: one of {left, right, disengage}. “Disengage” indicates that the rule would not subsequently be applied to the adjacent chromatin position; “left” and “right” indicate an immediate application to the neighboring position. With this notion of walking along the chromatin tape, we have resurrected the left and right movement of the Turing head in the Turing machine system.

Chromatin is known to form loops, allowing fairly distant regions along a chromosome to come into physical contact

Transcription factors are proteins that bind specific DNA sequences and carry out actions including chromatin modification, recruitment of additional proteins, and activation or suppression of gene expression. Transcription factors are easily modeled in our existing CC formalism as rule components that read marks corresponding to the DNA sequence co-localized with a nucleosome. The chromatin tape is initialized at each nucleosome with read-only marks representing the DNA sequence. Transcription factor binding site recognition is analogous to a “GOTO” instruction referencing an addressable memory cell in a random-access computer – a huge efficiency in programming. (One difference is that, while rare, the transcription factor binding sequence usually occurs multiple times in a genome, retaining an element of parallelism.)

Importantly, transcription factors alone would be insufficient to implement our mapping to a Turing machine, because of the lack of the ability to write to the chromatin tape.

A transcription factor of particular interest is CTCF. CTCF is known to have a role both in looping and as an insulator stopping the spread of marks along chromatin

Nucleosome remodelers remove, replace and shift histone octamers along the genome. Removal and replacement can easily be modeled with our existing CC as rules that simply change the marks on a CC nucleosome. If we are modeling DNA sequence for transcription factor binding, then nucleosome shifting relative to that sequence can be modeled in a number of ways. For example, a straightforward modification of the CC model would accommodate a second, read-only, tape for the DNA sequence with an alignment to the nucleosome tape. Rule functionality can be expanded to allow local changes in the alignment.

The Turing machine does not formally produce output beyond halting, in a final state. In practical applications of Turing machines or their variants, the symbols on the tape are usually read after the computation halts, providing useful output from the computation. In a synthetic implementation of a CC, it might be useful to read the chromatin marks after a computation has been carried out, but another readout can be gene “expression” implemented by a rule that reports the occurrence of an expression event from a particular chromatin tape location. The CC formalism is easily augmented to accommodate gene expression: the right hand side of each rule includes an output symbol corresponding to the genomic location of the chromatin.

Cell signaling changes chromatin state. A typical signaling cascade starts with binding of an extracellular ligand to a surface receptor, then transfers information via phosphorylation of a cascade of kinases; ultimately a transcription factor binds DNA and recruits additional complex proteins to effect a change in chromatin state and gene expression. In the CC model, this corresponds to a change of the program (or rule set), adding a rule involving the transcription factor complex.

Stored procedure computers, or universal Turing machines, store the programming instructions on the input tape instead of hard-coding them into the rules. The hard-coded rules interpret and execute the instructions (the “software”) on the tape. Since chromatin computers can simulate any Turing machine, they can simulate universal Turing machines. But the gene expression augmentation to the CC formalism provides a natural biological model for stored programs. The CC rules represent chromatin modifying complexes, which are collections of expressed gene products. Then biological CC rules are, indeed, written in the input chromatin: the gene products self-organize into new rules – and these rules in turn change chromatin state and gene expression. Thus biological chromatin is not only a stored procedure computer, but a self-modifying stored procedure computer.

DNA, and some of the associated chromatin marks, is replicated when cells divide. Copying of chromatin state is readily modeled in the CC formalism; however, a more convenient addition to the formalism is creation of a copy of the current tape, analogous to a multi-tape Turing machine.

Biological chromatin plays a complex role in cell biology. Many of the features of chromatin-interacting factors can be modeled as efficiency-gaining instructions in the CC programming toolbox. I have mentioned some of them above; there are more. None of these features invalidate the powerful Turing completeness result that rests on the model of a linear array of writable nucleosome positions operated on by a finite rule set.

Computational concept | Biological equivalent |

Writeable memory | Chromatin with chemical modifications |

Read-write rules | Chromatin-modifying complex (CMC) |

Parallel computer | Multiple copies of CMCs |

Non-determinism | Different CMCs that read the same chromatin configuration |

Addressable memory | Transcription factors binding specific DNA sequences |

Output | Gene expression or chromatin configuration |

Stored procedures | Genes coding for CMC components |

Self-modifying code | Changing expression of genes coding for CMC components |

To ask whether biological chromatin has the memory, rule set and speed capacity to carry out interesting computations, we can start from known biology. In

How rich are the programs that operate on that memory?

RNA polymerase II is a protein complex that transcribes DNA. Associated with polymerase function are factors that mark histones – for example, methylation of H3K4 and H3K36. Let us therefore take RNA polymerase II as one example of a chromatin-modifying complex and consider the rate at which it operates in the cell. One complex transcribes up to 90 nucleotides per second

An alternate calculation starts from the assumption that an average read/write operation might take 1 second, and that at any point in time 1% of the cell's 10,000,000 nucleosomes might be engaged at position 1 of a read/write complex. This gives us an estimate of 1,000,000 operations per second, or 1 MHz.

Our lower estimate of the compute power of biological chromatin, then, gives us hundreds of different rules operating on at least 80 megabytes of memory at a minimum of 10,000 operations per second. While a living cell may not use this capability to its fullest, it represents enormous capacity for information processing on biological time scales.

Each human cell contains at least 80 megabytes of writeable chromatin (see

Modeling chromatin as a computer suggests a number of lines of inquiry. DNA methylation is erased in the early embryo; does this serve a similar function to rebooting a computer – resetting memory to enable restarting of programs? Genes with variable expression tend to have nucleosome-free regions (NFRs) further upstream of their transcription start site than constitutively expressed genes

With the intensive level of research in chromatin biology along with genome-wide tools to elucidate complexes, enzymatic function and chromatin occupancy, we may soon have enough information about real complexes and the behavior of their component readers and writers to simulate the chromatin computation that occurs in cells, and to learn some of the pieces of data that we are still missing.

The idealized chromatin model may serve as a starting point for a new way of building DNA-based computer using chromatin modifications for a read/write machine. A chromatin computer would operate on a fixed DNA sequence, and use histone and nucleotide modifications as the writable symbols. To engineer a chromatin computer based on this insight, the rules would be implemented in designer chromatin modification complexes built from naturally-occurring parts (protein domains). In an early proof of concept, researchers used human polycomb chromatin protein and homologs from other species to construct modular synthetic transcription factors that recognize H3K27me3 and switch silenced genes on

The semantics and the value of the histone code concept have been the subject of debate, in particular the question of whether histone modifications carry much useful information beyond what can be inferred from transcription factor logic

Here I prove that a chromatin computer can compute any computable function. I do this by defining a reversible mapping from any Turing machine to a chromatin computer, and from each Turing configuration (called an Instantaneous Description by Hopcroft and Ullman

A Turing machine is defined by the 7-tuple

A Turing machine operates on an infinitely long tape. Each location (cell) on the tape contains a symbol. A configuration of the Turing machine captures all the information that changes as the Turing machine executes its program: the configuration includes symbols written in each cell on the tape, the position of the read/write head, and the current state. We will write a Turing machine configuration as

In the initial configuration, a finite number of cells on the tape can be written with symbols from

At each step in the computation, the matching rule from

To prove that a chromatin computer can compute anything computable by a deterministic Turing machine, we will show that any Turing machine can be mapped to a chromatin computer, and that the computation performed by the chromatin computer results in a final configuration that can be uniquely mapped back to the final configuration that would be achieved by the Turing machine.

First we map a Turing configuration (tape symbols, head location and state) to 3-chromatin. Each cell of the Turing tape is mapped to one nucleosome. The three positions of each nucleosome will be used as follows:

Position 1 indicates the position of the read/write head on the Turing tape and may take one of two values:

Position 2 is blank except when the Turing tape head is at the corresponding position on the Turing tape. In that case, Position 2 contains a mark representing the Turing machine state.

Position 3 contains a mark representing the Turing tape symbol written at that location.

This mapping is reversible: the single nucleosome marked with

A “

The first position of every nucleosome is marked with

The second position of every nucleosome is marked with _{CC}

The third position of every nucleosome is marked with an element of

We write a Turing-mappable CC configuration as

Note that while CC's in general are non-deterministic, a deterministic Turing machine maps to a deterministic CC: if there is only one applicable Turing rule for a given configuration, then that translates to exactly one applicable rule in the CC.

The chromatin computer implementing the Turing machine is specified as follows:

The transition function

Each “move left” Turing machine rule

_{1}x_{1} BB*_{2}- BBx_{2} ---

Each “move right” Turing machine rule

_{1}x_{1} BB*_{2} Hq_{2}-

We show by induction that at each step of the CC computation, the configuration of the chromatin tape is Turing-mappable and is isomorphic to the state of the Turing tape after the same number of Turing machine steps, and that the CC will halt if and when the Turing machine halts.

The base case is the isomorphism between the initial configurations of the machines.

For the induction, we assume that after ^{th} step, the configurations remain isomorphic. Assume that the Turing rule that applies at this step is _{1}x_{1} BB*_{2}- BBx_{2} ---

The only symbol to change on the Turing tape is the symbol in the cell at the read/write head (position ^{rd}-position mark to change is the one having the ^{th} nucleosome marked with

The read/write head position moves from

The Turing machine state changes from

Finally, we show that the CC halts when the TM halts. The TM halts when a TM rule is applicable that moves the TM to a state in

(PDF)

Thanks are due to David Yee and Rich Ferrante for multiple rounds of feedback and discussion. I am grateful to my Constellation colleagues who taught me about chromatin, and to David Allis, Keith Robison, and Sebastian Hoersch, for discussions, reading the manuscript and giving useful feedback. Thanks also to Phil Bourne, Larry Hunter, Adam Rudner, Yang Shi, Greg Tucker-Kellogg, Lisa Tucker-Kellogg, Jim Audia and Mark Goldsmith for advice and encouragement.