^{1}

^{*}

^{2}

^{2}

^{3}

Conceived and designed the experiments: JS RW. Performed the experiments: JS RW. Analyzed the data: JS RW. Contributed reagents/materials/analysis tools: JS MS JM RW. Wrote the paper: JS RW.

The authors have declared that no competing interests exist.

As biologists increasingly rely upon computational tools, it is imperative that they be able to appropriately apply these tools and clearly understand the methods the tools employ. Such tools must have access to all the relevant data and knowledge and, in some sense, “understand” biology so that they can serve biologists' goals appropriately and “explain” in biological terms how results are computed.

We describe a deduction-based approach to biocomputation that semiautomatically combines knowledge, software, and data to satisfy goals expressed in a high-level biological language. The approach is implemented in an open source web-based biocomputing platform called BioDeducta, which combines SRI's SNARK theorem prover with the BioBike interactive integrated knowledge base. The biologist/user expresses a high-level conjecture, representing a biocomputational goal query, without indicating how this goal is to be achieved. A subject domain theory, represented in SNARK's logical language, transforms the terms in the conjecture into capabilities of the available resources and the background knowledge necessary to link them together. If the subject domain theory enables SNARK to prove the conjecture—that is, to find paths between the goal and BioBike resources—then the resulting proofs represent solutions to the conjecture/query. Such proofs provide provenance for each result, indicating in detail how they were computed. We demonstrate BioDeducta by showing how it can approximately replicate a previously published analysis of genes involved in the adaptation of cyanobacteria to different light niches.

Through the use of automated deduction guided by a biological subject domain theory, this work is a step towards enabling biologists to conveniently and efficiently marshal integrated knowledge, data, and computational tools toward resolving complex biological queries.

Biologists must increasingly conduct computational analyses across integrated biological knowledge and data

The present paper explores just such an “intelligent” paradigm that we call

Engineers approach complex knowledge-based goals in three general ways: (a) the “procedural” approach, (b) the “relational” approach, and (c) the “deductive” approach. In the procedural approach (a) one specifies each step required to reach the goal, down to the level at which the steps are primitives in the domain language. Such procedures often involve search in knowledge bases, generally traversing the structures that represent complex knowledge, transforming data, and conducting specific calculations along the way. Usually writing programs of this sort is beyond the skill of biologists.

In the relational approach (b) queries are written such that the complex inner loops (searching over records) are implicit; for example:

In the deductive approach (c), the biologist specifies a high-level goal, usually without knowing how it will be satisfied, and a runtime executor (called a

Leaving method determination to the computer is a significant though difficult advance; whereas the implicit search carried out by the relational database engine requires knowledge of how to execute and optimize a certain class of algorithms (complex loops over relational databases), the deductive approach requires that the computer have additional understanding of how aspects of the knowledge base are connected to one another, and how to use these connections in service of high-level goals. The computer must, in a sense, “understand” how the knowledge is organized, and how it relates to high-level goals. Generally the knowledge-base designer provides such information to the theorem prover in what we shall call a

The core of a deductive approach is the subject domain theory, an ontology comprising definitions of domain concepts, descriptions of the capabilities of available resources including data, knowledge, and tools, and the background knowledge necessary to relate these to high-level queries. While the word

While the applicability of the deductive approach is independent of any particular implementation, our prototype implementation, called BioDeducta, is built largely from existing components. The theorem prover is SRI's reasoner SNARK

Biology-specific data, knowledge, and software resources are drawn from the BioBike environment

We here illustrate the BioDeducta approach with an extended, realistic example. The cyanobacterium subspecies Procholorococcus is widely distributed in the world's oceans, and plays a critical role both in the marine ecosystem and in the global carbon cycle

In sum, we can expand this question as follows:

Which photosynthesis-related proteins in ProMed4 have no ortholog in Pro9313 but do have an ortholog in s6803 such that the genes producing those proteins exhibits a light stress response (greater than 2× ratio in microarray data), and possibly are annotated as light-related genes?

One algorithm to find such genes might be:

Expressed in the native BioBike language (BioLisp), this might be written as follows:

Although this solution is concise and relatively efficient, its programming requires detailed knowledge of both the BioLisp programming language, and of how to call upon BioBike's knowledge resources. Moreover, this is a one-shot solution specific to this particular problem, not affording of significant reuse (i.e., methodological modularity). Nor does this approach get us any more than the solution reported as an opaque answer that cannot be unpacked into the method that solved it (provenance).

In contrast to this approach, the BioDeducta methodology allows us to more conveniently express and solve this problem while simultaneously offering methodological modularity and solution provenance. We begin by expressing our query in high-level terms familiar to the biologist, and then unpack each concept into modular conceptual constituents in the subject domain theory. It is the job of the theorem prover to figure out how to use the guidance offered by the subject domain theory to find a solution to the top query.

The query might be expressed as follows: What gene enables med4 to adapt to its light environment? Or, expressed in terms of a SNARK formal conjecture, does there exist a gene

The satisfaction of this conjecture in the subject domain theory is to be found by the theorem prover. Terms in the query preceded by question marks (

The meanings of the symbols of a query, such as

The subject domain theory has three parts: (1) modular definitions that enable SNARK to translate the high-level query into a search procedure through subgoals that finally ground out in terms of ground knowledge and BioBike procedures, (2) simple ground knowledge that requires no internal (subgoal) computations, and (3) procedural attachments into the BioBike knowledge base that access computed knowledge and which may perform internal “hidden” computations. As we proceed with this exposition, it is important to keep in mind that these various resources are independent of the particular

In this presentation we begin from the goal, and work our way conceptually downward. Recall that we seek a proof of the theorem that establishes the existence of a gene (

(More precisely we seek a refutation of the negation of this assertion. This process, called “proof by refutation”, is explained in detail in Russell and Norvig

Here is the definition of what it means for a gene to be adaptive—that is, for it to be related to the way in which an organism adapts to its environment:

That asserts that a gene is related to the way in which an organism adapts to its environment along a particular environmental dimension (i.e., light, in the present example) if and only if four conditions are met: First, the gene must be a gene of the given organism. Second, the gene's putative function (per its explicit annotation, provided by the database) must be conceptually related to the relevant dimension. Third, the gene must (genomically) differentiate between organisms that live in environments that differ along the relevant dimension (here: light). And fourth, the gene must be differentially regulated in an experiment that explores the relevant dimension (again: light).

Importantly, none of the specifics of the query are explicit in this formulation of what it means to be an adaptive gene; this axiom is general to this sort of problem, and uses internal formulae, which are likewise general (and which we explain below). One could argue about whether this is precisely what one intends by the biological concept of “adaptive gene”, but it is easy to adjust this definition if one desires.

Continuing to unpack the meaning of terms: A differentiating gene (more precisely: a genomically differentiating gene) is one that exists in one organism and not in another that lives in a different niche, and where the niches differ along the relevant dimension:

Note that this axiom asserts the existence of a second organism, unspecified in the formula that calls upon this axiom. When SNARK encounters the

We define the concept of being differentially ecotyped as follows: There is another organism (presumably in the same group as our target organism) that lives in a different environment regarding the specified dimension (light, in the present case). Here we very roughly approximate the qualities of light as low vs. high.

Note that this must be expressed in both directions (across the or) to handle the case in which

Finally, to determine differential regulation we use BioBike's knowledge of microarray experimental results on the given (light) dimension (i.e., from the Hihara et al.

Here

In addition to the axiomatization above, and extensive knowledge of genes, organisms, ontologies, and microarray data built into BioBike and accessed by SNARK via procedural attachments (as described in the next section), we must provide knowledge that is specific to the present problem. Since the query is expressed in terms of the word “light”, most of this relates that term to various organisms and experiments.

These first three assertions tell SNARK that light varies along three qualitative dimensions, low, medium, and high, and that the light environments of the three organisms, mit9133, s6803, and promed4, are those niches respectively:

These assertions are to be read as: The organism “mit9313” on the environmental dimension “light” has the quality value “low”, and so on. We must also relate the Hihara et al.

If desired, an arbitrary amount of additional irrelevant knowledge could be added to make this example more realistic, but because theorem proving as used here is formally monotonic such additional knowledge would not change the results of the example, although it might slow down the proof process as false leads and dead ends are explored and rejected by the theorem prover.

Finally, we provide concepts that are implemented by procedural attachment in the BioBike system:

Given the concepts of gene-in-organism and ortholog, as above, it is easy to introduce an axiom that conceptually defines gene-has-ortholog-in-organism as follows:

That is, a given gene has an ortholog in a specific given organism if and only if there is a gene

When SNARK encounters expressions with procedural attachments, such as:

a data source is invoked that yields all the genes in the given organism (Procholorococcus sp. strain Med4). In different branches of the search space, the variable

Two somewhat more problem-specific primitives are needed in order to work with microarray data, and to examine gene annotations:

We also add assertions that tell us that when we ask about the light semantics along the light dimension, SNARK should make use of this built-in photosynthesis-related predicate to determine whether or not this is the case for a given gene:

We will return in the discussion to consider other interesting ways in which this axiom could be expressed.

We provided BioDeducta all of the above and asked it to find a gene (and other related terms) that satisfy our query:

Once the proof is complete, the theorem prover extracts an answer to the query by examining what term replaces the variables

In other words, a low-light organism that has no ortholog to

We conducted a number of additional experiments, demonstrating that BioDeducta's results mirror the results computed by the equivalent BioLisp programs run in the same BioBike database. For example, as was mentioned above, one may quibble with the specifics of our choice of definitions, but it is easy to change the meanings of terms, or to add alternative formulations that modularly work together with existing axioms. For example, one might wish to change the definition of

Note that the

Having demonstrated how a combination of axiomatic reasoning, answer extraction, and procedural attachment may offer biologists access to powerful biocomputing analyses, we next turn to discussion of some closely related work, followed by discussion of some of the issues and opportunities raised by our work.

A bevy of activity in biocomputing is concerned with the formal representation and reasoning about biological pathways. An excellent example of this is the Pathway Logic work based on the Maude rewriting logic paradigm

The fact that Pathway Logic operates in the domain of biological signaling pathways, whereas the examples in this paper are in the domain of genomic conjectures, is an incidental difference resulting from our respective choices of problems. Because we both use explicit models, we can in principle do one another's problems by representing one another's subject domain theories. That is, given a subject domain theory containing axioms that define temporally related biological events, BioDeducta will do exactly the same work as Pathway Logic. Indeed, in the BioBike Live Tutorials that come with the BioBike system we develop several other examples, including one that analyzes protein regulation models (see the Software Availability section, below). The subject domain theory for that example includes axioms such as:

One can consider Maude as a sub-logic of SNARK, specialized to certain forms of rewriting-based reasoning. Although far less general than SNARK, Maude is very efficient for certain kinds of problems, such as reachability (e.g., Can a certain molecule be generated from given precursors?) What is ultimately needed is an integration of SNARK with Maude (and other tools) so that queries that are solvable in principle in first-order logic can be solved efficiently with specialized logically sound algorithms such as the model-checking techniques available in Maude. (We thank Mark-Oliver Stehr for this insightful discussion.)

Some other systems, such as HyBrow

Furthermore, as with Pathway Logic, the apparent difference in domain between Hybrow and BioDeducta is merely a happenstance of the examples we have chosen. Axioms such as:

could serve the purpose in BioDeducta of computing the same sorts of analyses as HyBrow. Indeed, SNARK has as a built-in version of the Allen temporal calculus

The central difference between BioDeducta (or Pathway Logic) and HyBrow is that BioDeducta is an inference engine armed with an explicit axiomatization of its subject domain theory, whereas HyBrow is an ad hoc program in which the equivalent of the subject domain theory is built into special-purpose code. Therefore, HyBrow cannot carry out inference and so HyBrow users cannot use high-level descriptions that are grounded by a subject domain theory in a principled way. Furthermore, HyBrow cannot give explanations that involve such transformations (even if they were carried out in the ad hoc HyBrow code). In these important ways, BioDeducta goes well beyond HyBrow and HyBrow-like systems.

Being very general, SNARK's proof process will usually be slower than specially written BioLisp programs because the latter may take advantage of specific properties of the problem to speed up search. Our solution to the light acclimation query is slower than a cleverly crafted program, but faster than a naive one. It does, however, require theory-specific domain engineering and strategic work to achieve good performance. This may be chalked up to the price one pays for the flexibility afforded by using a full first-order theorem prover with an explicit subject domain theory.

Full first-order logic is undecideable in general, meaning that there is no hope of developing a fast

Through the use of weights and clause ordering, the designer of the subject domain theory can force certain sub-formulae to be treated before others. In the present example, for instance, some symbols have procedural attachments while others do not. Some procedural attachments, such as

Useful properties may also be stated for relations, such as symmetry or reflexivity. For example,

Regardless of all these manipulations, given that s6803 has 3722 genes, promed4 has 1760, and pro9313 has 2328, even if it takes several minutes for SNARK to compute a solution, this is far less time than it would take a biologist to do the same work manually or using spreadsheets.

Even aside from these tuning details, the mere construction of a subject domain theory is a complex and error-prone task. Fortunately, in theory it need be done only once for each subject domain. Furthermore, we do not need to begin from scratch but can import appropriate sections of subject domain theory from such standards as Cycorp's OpenCyc

One can also build a subject domain theory by composing simpler theories. One should think of the subject domain theory as a sort of dictionary of biological concepts; it modularly describes how concepts are cached out in terms of other (simpler) concepts, eventually reaching ground facts or underlying computations. The development of any dictionary is not trivial, but because of the generality of the theorem prover, the in-principle modularity of the subject domain theory is essentially guaranteed (although not necessarily its efficiency, unless steps such as those described above are taken as well).

Of course, merging subject domain theory components may not be straightforward; different theories may use different symbols for describing the same concept, or may use the same symbol with different meanings. The notions of theory morphism and colimit, obtained from the mathematical theory of categories

Regardless of the specific approaches to mitigating issues of efficiency and of the completeness and correctness of the subject domain theory (not to mention ambiguity and arguments about definitions!), problems are bound to arise in any project of the sort we have described. As with any such project, only the long-term efforts of a dedicated community can work these out. We offer technology that is powerful enough to afford correct, complete, efficient, and explicit solutions—the biocomputation community will, over time, work out the details. BioBike is a collaborative platform within which such a community can engage in efforts of this sort.

We opened this paper with the biologist in increasing need of the ability to conduct novel computational analyses without programming, and offered BioDeducta as an approach to this. Whereas BioDeducta may provide significant opportunities in terms of methodological modularity and provenance, it may still be difficult to imagine biologists expressing queries in SNARK's logical notation (or, similarly, writing SNARK axioms). One approach to this problem is to provide graphical support for query formulation, such as was done in NASA's Amphion system

Another approach to simplifying the query (or theory) formulation task is to use quasi-natural-language. Whereas true natural language programming (or at least querying) has been a holy grail of AI since time immemorial, we do not imagine that this is possible in the near term, even in narrow domains such as biocomputing. However, the fact that the BioDeducta subject domain theory is explicitly formulated makes certain sorts of quasi-natural language a real possibility. SRI's GeoLogica and QUARK

We have conducted preliminary experiments with natural language in BioDeducta, using the same method as described for the GeoLogica system

For example, the query

“Find a gene that pertains to Promed4 and that does not have an ortholog in Pro9313.”

translates to a logical form that is thence translated by a language subject domain theory into a BioDeducta conjecture, which is thence proved, just as was done in the hli example, above. In the end, the proof produces the answer

Examining alternative proofs to the theorem produces multiple answers:

This example required 5 seconds to produce the first answer, and additional answers were almost instantaneous.

Many other sorts of queries can be addressed by BioDeducta using this approach. Examples include

“Does pmm0226 not have an ortholog in mit9313?”

“What is the Hihara mean regulation ratio of pmm0226?”

Moreover, various semantically equivalent forms are acceptable. For example, one can ask about the “hihara ratio” or the “hihara regulation ratio”, which are taken to be synonymous.

Much as this is encouraging, we do not pretend to have solved the natural language problem for biocomputing. To express such queries (in any language, natural or otherwise) users must know a great deal about the system's capabilities. Natural language provides the illusion that the system can understand everything whereas it is difficult for a system to engage the user in a natural language dialogue to indicate what it does and does not understand.

While naïve users may not be able to formulate the appropriate logical query, we may be able to guide such users to formulate logical queries even if they are ignorant of logical notation or the vocabulary of the subject domain theory. This approach depends upon the use of a sorted theory, one in which each constant is assigned a sort, that is an indicator of a class to which it belongs; thus, promed4 may be declared to be of sort organism (or sort bacterium, a subsort of organism). Similarly, each function symbol is given a declaration of the sorts of arguments it requires and the sort of value it produces, and each relation has a declaration of the sorts of arguments it expects. Such declarations are valuable for a theorem prover in that they restrict search, admit shorter proofs, allow some error detection, and permit more concise axioms and queries. But a sorted theory is of special value in that it allows query elicitation. Let us imagine that a user was trying to formulate the complex query discussed above. The user might select the term promed4 from a menu of known organisms. Since promed4 is of sort organism, the system would offer a menu of relations and functions that accept terms of sort organism as arguments including, for example,

The goal of BioDeducta is to put biological computation directly into the hands of biologists themselves—to enable them to manipulate biological knowledge and data in an interactive computational environment, and to produce results that are backed up by explicit explanations (the proofs). But biocomputation is not a simple art, often requiring one to program complex navigations within and between complex knowledge bases. Although BioBike offers the full power of a mature programming language, it puts the burden of figuring out how to navigate the knowledge bases entirely upon the biologist/users themselves. BioDeducta can in principle assist the user by taking advantage of the guidance of an explicit subject domain theory to find its way through the knowledge base to answer complex queries. Of course, the more meta-knowledge is available to describe knowledge bases and their relationships, the more complete is BioDeducta's ability to offer assistance in this regard.

In concluding, we wish to emphasize an aspect of the present approach that helps address a critical problem in computational biology: the problem of provenance—that is, tracking how results are calculated, especially as it applies in the annotation of biological function.

The concept of gene function is highly problematic for a number of reasons, not the least of which being that the way in which function is determined by the annotator is not generally made explicit in the annotation. As more and more genomes come online faster and faster, functional annotation is more and more being done through computational methods rather than by experiment. Because the provenance—the data behind such annotations—is not stored, there is the potential—in fact, the near certainty!—of propagating errors and, complementarily, failing to propagate corrections

Recall that the explicit subject domain theory is not so much a theory of biology as a description of the way to expand high-level biological concepts in terms of more primitive concepts, finally reaching “ground” terms and functions in the knowledge base or BioBike functions. By virtue of this, the proof constructed by SNARK in the process of proving a given conjecture forms an explicit trace or “explanation”, including the specific axioms used, and the ways in which the variables were bound in the axioms that lead to a given result; the more explicit the subject domain theory, the more detailed the explanation. By virtue of this fact, BioDeducta proofs represent precisely the sort of explicit provenance whose absence is endangering the very underpinnings of molecular biology, and by recoding the proofs along with the results that underpinning could be, at least in part, restored!

BioDeducta is SNARK+BioBike. SNARK is built-in to the BioBike demo server, accessible through

The examples in this paper are available as BioBike Live Tutorials, also at

Preparation of the Hihara et al. (2001) Data

(0.03 MB DOC)

Complete Refutation Proof for the adaptive gene conjecture resulting in PMM0817

(0.06 MB DOC)

Mike Travers designed and initially implemented the BioBike frame system and did much of the threading of the BioBike knowledge base. Yannick Pouliot offered us advice on problem domains and on the use of the BioWarehouse, and Carolyn Talcott and Merill Knapp helped us through discussion of data sources and formalization of the subject domain. The light acclimation example is based on a problem posed by Jeff Elhai of Virginia Commonwealth University. Carolyn Talcott and Mark-Oliver Stehr gave us many comments on this paper.