Empirical companion to: Primitive Relations, Computational Complexity, and a Conjecture on the Genomic Computational Class and The Genome as Computer
The companion theoretical papers conjectured that gene regulatory circuits can be classified by computational complexity class (I–V) based on their topological structure, and that this classification has measurable consequences for the predictability of circuit behavior. This paper reports the first empirical test of that conjecture. We classified 780 genes from the genome-scale Replogle et al. (2022) K562 Perturb-seq dataset by GLMP complexity class using the TRRUST regulatory network and literature curation, then evaluated 16 grammar-blind perturbation prediction models — including 14 methods from the Nature Methods 27-method benchmark and Arc Institute’s STATE transformer (scored under two metrics) — on whether prediction accuracy varies by circuit class. We find that Class III (bistable/positive feedback) genes are systematically harder to predict than Class I (feed-forward) genes: 12 of 16 methods (75%) show lower accuracy for Class III, and the mean accuracy deficit is statistically significant across methods (one-sample t-test, t = −3.55, p = 0.0015). This constitutes the first empirical evidence that computational complexity class, as defined by regulatory circuit topology, correlates with the difficulty of predicting perturbation responses. The grammar-aware comparison (CellOracle vs. grammar-blind models) was limited by CellOracle’s restriction to transcription factor perturbations; a GRN topology analysis of the CellOracle-inferred network provided partial corroboration. We discuss the implications for virtual cell model benchmarking and the theoretical framework.
Virtual cell models — AI systems that predict how cells respond to genetic or chemical perturbations — are evaluated by aggregate accuracy across all genes. No existing benchmark stratifies predictions by the regulatory circuit topology of the perturbed gene. The theoretical framework of Papers I and II predicts that this topology matters: a feed-forward circuit (Class I) should be inherently more predictable than a bistable switch (Class III) or a self-modifying epigenetic circuit (Class V), regardless of the model used.
If correct, this prediction has immediate practical consequences. It means that AI models of biology face not merely technological limitations but mathematical ones: for circuits above a certain complexity class, perfect prediction is impossible in principle. It also means that current aggregate benchmarks are misleading — a model reporting 70% accuracy may be achieving 95% on Class I genes and 30% on Class V genes, and the aggregate hides this biologically fundamental distinction.
We propose to test this by re-stratifying existing virtual cell model benchmarks by circuit class. The experimental design requires no new biological data — only a new axis of analysis applied to existing public datasets.
Virtual cell model prediction accuracy decreases monotonically from Class I (feed-forward) through Class V (self-modifying epigenetic) when measured on genome-scale Perturb-seq benchmarks.
Models that explicitly represent gene regulatory network topology (grammar-aware: CellOracle) outperform models that do not (grammar-blind: STATE) when evaluated per circuit class, and the performance gap is largest for Class III (bistable) circuits.
For Class I (feed-forward) circuits, both grammar-blind and grammar-aware models approach high accuracy (>0.7 Pearson correlation), and the inter-model accuracy gap is small (<0.05). Feed-forward circuits are decidable; knowing the topology adds little because there is no hidden state.
Genes regulated by Class III (bistable) circuits exhibit bimodal expression distributions in single-cell data. Models that fail to predict the correct attractor state for individual cells will show a characteristic error pattern: high accuracy for the population mean but low accuracy for individual cell predictions.
For Class V (self-modifying epigenetic) circuits, both models converge to low accuracy, and the grammar-aware model gains little advantage. If Class V circuits approach Turing-completeness, Rice’s theorem implies that no algorithm — regardless of architecture — can perfectly predict their behavior. The accuracy ceiling for Class V should be strictly less than for Class I.
The fraction of true differentially expressed genes (DEGs) correctly recovered by each model decreases with circuit class. Class I perturbations have simple, localized effects (few DEGs, high recovery). Class IV–V perturbations have complex, distributed effects (many DEGs, low recovery).
CellOracle’s inferred gene regulatory network, when classified by the same topology criteria as GLMP, assigns the majority of genes to the same complexity class as the GLMP classification. Where they disagree, the discordance is informative: genes misclassified by CellOracle (relative to GLMP) should show lower prediction accuracy than concordant genes.
The primary benchmark is the Replogle et al. (2022) K562 Perturb-seq dataset — genome-scale CRISPRi perturbations in K562 human chronic myeloid leukemia cells. This dataset perturbs approximately 5,000 essential genes with single-cell RNA-seq readout and is the standard benchmark used by STATE, CellOracle, and the Nature Methods 27-method comparison (2025). It is available in processed form on the CZI Virtual Cells Platform and in harmonized format via scPerturb.
For each perturbed gene that encodes a transcription factor, signaling protein, or chromatin modifier, we classify the gene’s regulatory circuit by GLMP complexity class using the following evidence hierarchy:
| Priority | Source | Method | Confidence |
|---|---|---|---|
| 1 | GLMP database | Direct classification from existing typed flowchart | High |
| 2 | Published literature | Well-characterized circuits (lac operon, p53-MDM2, Yamanaka factors, etc.) | High |
| 3 | TRRUST v2 / ENCODE | Known human TF-target interactions; classify by network motif topology | Medium |
| 4 | RegulonDB | E. coli ortholog regulatory interactions (for conserved circuits) | Medium |
| 5 | GLMP pipeline | LLM-generated typed flowchart from literature, classified by topology | Low–Medium |
Circuit class assignment follows the five-class ladder from Paper I:
| Class | Topology Criterion | Formal Test | Prototype Example |
|---|---|---|---|
| I | No feedback edges in regulatory subgraph | Directed acyclic graph (DAG) | Simple inducible promoters; JUN/AP-1 cascade |
| II | Negative autoregulation or negative feedback loop | Single negative cycle in subgraph | HIF1A-VHL loop; NF-κB–IκB loop |
| III | Positive feedback or mutual repression (bistable) | Positive cycle or double-negative cycle | p53-MDM2 toggle; GATA1-PU.1; Yamanaka circuit |
| IV | Mixed feedback with delay elements (oscillatory) | Negative cycle with ≥3 nodes (repressilator topology) | CLOCK-BMAL1 circadian loop |
| V | Self-modifying: circuit alters its own regulatory architecture | Chromatin modifier targets its own regulatory region | DNMT3A self-methylation; EZH2 autosilencing |
The following GLMP-style flowcharts illustrate the topological distinction between classes. These are deliberately simplified to highlight the structural feature — the presence or absence of directed cycles — that determines the class. Full GLMP flowcharts with complete molecular detail for these and other regulatory circuits are available in the GLMP database. Colors follow the GLMP palette: green = gene/protein, red = repression, blue = activation/signal, orange = regulatory outcome.
Class I — Feed-forward cascade (JUN/AP-1). Signal flows in one direction; no cycle. The perturbation response is fully determined by the current input.
Class II — Negative feedback (ATF4 stress response). A single negative cycle produces self-correction: the output damps the signal that produced it. The circuit is homeostatic.
Class III — Bistable switch: GATA1 / PU.1 mutual repression. Each transcription factor represses the other, creating two stable attractor states: high-GATA1 (erythroid fate) or high-PU.1 (myeloid fate). Once a cell “nucleates” into one state, the mutual repression lock maintains it. A grammar-blind model observing a snapshot cannot determine which attractor the cell has committed to. This gene’s benchmark accuracy (mean r = 0.699) is below the Class I average (0.780).
Class III — Bistable switch: MYC autoactivation. MYC protein activates its own transcription through a positive feedback loop. Above a threshold, the circuit locks into a high-MYC proliferative state — the “siphon” is self-sustaining. The initiating signal is no longer necessary. Benchmark accuracy: mean r = 0.817, above Class I average — illustrating that not all Class III genes are hard to predict, but the class as a whole is.
Class III — Bistable switch: VHL-HIF oxygen sensing. Under normoxia, VHL degrades HIF1α. Under hypoxia, HIF1α accumulates and activates a transcriptional program including genes that further stabilize the hypoxic state. The circuit exhibits bistable behavior around the oxygen threshold. Benchmark accuracy: mean r = 0.921, the highest in Class III.
Class IV — Delayed negative feedback (circadian CLOCK–BMAL1 / PER–CRY). Class IV denotes mixed feedback with delay: typically a negative loop spanning three or more nodes so that transcription–translation delays convert a simple negative feedback into sustained oscillation rather than a static toggle. The CLOCK–BMAL1 heterodimer activates Per and Cry genes; the accumulating PER–CRY complex feeds back to repress CLOCK/BMAL1-driven transcription after a lag, closing the loop. The relevant hidden variable is often phase (where the cell sits in the cycle), not only which attractor basin is occupied. Only two benchmark genes fall in Class IV in our table (HDAC7, TFDP1); interpret aggregate Class IV statistics with caution.
Class V — Self-modifying chromatin (BRD4). BRD4 reads acetylated histones and recruits the transcriptional machinery, which includes histone acetyltransferases that create more acetylation marks — modifying the regulatory architecture itself. The circuit rewrites its own program. Benchmark accuracy: mean r = 0.876.
The original design called for a direct comparison between a grammar-blind model (STATE) and a grammar-aware model (CellOracle). In practice, we evaluated a broader set: 14 methods from the Nature Methods 27-method benchmark (2025) and Arc Institute’s STATE transformer, for a total of 16 grammar-blind evaluations spanning diverse architectures (STATE was scored under both its native HVG metric and the benchmark DE20 metric).
| Method | Architecture | Source |
|---|---|---|
| scGPT | Transformer (pre-trained) | Cui et al. 2024 |
| GEARS | Graph neural network | Roohani et al. 2023 |
| GeneCompass | Transformer (pre-trained) | Yang et al. 2023 |
| GenePert | MLP + gene embeddings | Gao et al. 2024 |
| AttentionPert | Attention-based | Dong et al. 2024 |
| CPA | Variational autoencoder | Lotfollahi et al. 2023 |
| scELMo | Language model | Liu et al. 2024 |
| bioLord | Disentangled representation | Lotfollahi et al. 2024 |
| scouter | Meta-learning | Ji et al. 2024 |
| linearModel | Linear regression baseline | Benchmark paper |
| trainMean | Training mean baseline | Benchmark paper |
| baseMLP | MLP baseline | Benchmark paper |
| baseReg | Regularized linear baseline | Benchmark paper |
| baseControl | Control baseline | Benchmark paper |
| STATE (ST-HVG-Parse) | Transformer (virtual cell) | Arc Institute, Hie et al. 2025 |
CellOracle (Morris Lab) was evaluated separately as the grammar-aware comparator, but its use was constrained: CellOracle can only simulate perturbations for genes that are transcription factors in its inferred GRN, limiting the direct comparison to a small subset. We supplemented this with a GRN topology analysis of the full CellOracle-inferred network (Section 8.4).
For the 14 benchmark methods, per-gene Pearson correlations between predicted and observed expression change vectors were obtained from the published benchmark results, evaluated on the top 20 differentially expressed genes per perturbation. For STATE, we ran inference independently using the ST-HVG-Parse model on Google Colab (T4 GPU) and computed Pearson correlations across 2,000 highly variable genes. This difference in evaluation scope is noted as a caveat in the interpretation.
For each model and each metric, we compute the mean accuracy per circuit class and test for a monotonic decrease. The primary test is whether Class III (bistable) genes show systematically lower prediction accuracy than Class I (feed-forward) genes across methods.
| Test | Hypothesis | What It Answers |
|---|---|---|
| One-sample t-test on C3−C1 differences | H1 | Is the mean accuracy deficit for Class III significant across methods? |
| Sign test (binomial) | H1 | Do more methods show C3 < C1 than expected by chance? |
| Mann-Whitney U (per method) | H1 | Within each method, do Class I and III distributions differ? |
| OLS regression with effect-size covariate | Confound | Does class predict accuracy after controlling for perturbation magnitude? |
| Effect-size matched subsampling | Confound | Is the class effect robust to matched controls? |
| Bootstrap CI | All | 95% confidence intervals on per-class accuracy means |
We classified 826 genes from the K562 Perturb-seq dataset by GLMP complexity class. Of these, 780 overlapped with the set of genes evaluated in the 14-method benchmark, forming the analysis cohort. Classifications were derived from TRRUST v2 network topology (medium confidence) for genes with known regulatory interactions, supplemented by literature curation (high confidence) for well-characterized circuits, with the remainder assigned to Class I by default (low confidence).
| Class | N (classified) | N (in benchmark) | Representative Genes | Primary Evidence |
|---|---|---|---|---|
| I | 785 | 748 | ANAPC4, BCL2L1, JUN, FOSL1 | TRRUST DAG / default |
| II | 13 | 8 | ATF4, BRCA1, HDAC3, PIK3C3 | TRRUST negative cycle |
| III | 17 | 14 | MYC, VHL, GATA1, RAD51, MED1, CDC20, BUB1 | Literature + TRRUST positive cycle |
| IV | 2 | 2 | HDAC7, TFDP1 | TRRUST oscillatory motif |
| V | 9 | 8 | BRD4, SMARCB1, CBX1, INO80B, MIS18BP1 | Literature: self-modifying chromatin / self-templating epigenetic |
The class distribution is highly skewed: 96% of genes fall into Class I, reflecting both the genuine preponderance of feed-forward circuits and a conservative classification strategy where genes with no known feedback were assigned to Class I by default. Classes IV and V have very small sample sizes (N = 2 and N = 8 in the benchmark), precluding reliable statistical inference for these classes individually. The Class III set (N = 14) is the smallest class with sufficient statistical power for a meaningful comparison against Class I, and is therefore the focus of the primary analysis.
The Class III set was expanded from an initial 5 genes (identified purely from TRRUST topology) to 14 by systematic literature curation of well-characterized bistable circuits (e.g., TP53-MDM2, STAT3 autoactivation, MYC positive feedback). All additions were made before examining their prediction scores.
Per-gene Pearson correlations for 14 methods were obtained from the published Nature Methods benchmark results (evaluated on top 20 differentially expressed genes per perturbation). Across all 780 genes, the mean accuracy ranged from 0.62 (baseControl) to 0.81 (trainMean), with most deep learning methods between 0.66 and 0.81.
STATE ST-HVG-Parse was run on the same K562 dataset using Google Colab (T4 GPU, 15.6 GB VRAM). The model was evaluated in zero-shot mode (no task-specific fine-tuning). Predictions were scored under two metrics: (a) Pearson correlation across 2,000 HVGs (the model’s native evaluation), and (b) Pearson correlation across the top 20 differentially expressed genes per perturbation (matching the 14 benchmark methods).
HVG metric (2,000 genes): Mean r = 0.088, median = 0.074.
DE20 metric (top 20 DE genes): Mean r = 0.003, median = 0.004.
Both metrics show dramatically lower accuracy than the 14 benchmark methods (r ≈ 0.7–0.8 on DE20). Notably, restricting to DE20 genes does not improve STATE’s performance — it worsens it, with many genes showing negative correlations. This indicates that STATE’s zero-shot predictions genuinely do not capture the strongest perturbation effects in K562, rather than merely being diluted by noise across 2,000 genes. For the class-stratified analysis, we use the DE20 metric to ensure comparability with the benchmark methods.
For each of the 16 methods (14 benchmark + STATE scored under two metrics), we computed the mean Pearson correlation for Class I genes and Class III genes, then took the difference (C3 − C1). A negative value means Class III is harder to predict. STATE is shown under both its original HVG metric and the DE20 metric that matches the benchmark evaluation; the DE20 metric is used in the primary analysis for comparability.
| Method | C3 − C1 | Direction |
|---|---|---|
| GeneCompass | −0.086 | C3 harder |
| CPA | −0.079 | C3 harder |
| trainMean | −0.073 | C3 harder |
| baseReg | −0.065 | C3 harder |
| linearModel | −0.048 | C3 harder |
| GEARS | −0.047 | C3 harder |
| scGPT | −0.045 | C3 harder |
| scELMo | −0.033 | C3 harder |
| STATE (DE20) | −0.031 | C3 harder |
| GenePert | −0.022 | C3 harder |
| AttentionPert | −0.016 | C3 harder |
| baseMLP | −0.010 | C3 harder |
| baseControl | +0.004 | ~equal |
| scouter | +0.016 | ~equal |
| bioLord | +0.017 | ~equal |
| STATE (HVG) † | +0.022 | ~equal |
† STATE (HVG) is shown for reference but excluded from the primary analysis because it uses a different evaluation metric (2,000 HVGs vs. top-20 DE genes). When re-scored on DE20, STATE shows C3 harder, consistent with the other methods.
12 of 16 methods show Class III genes as harder to predict than Class I.
Sign test (binomial, H1: proportion > 0.5): p = 0.038 (significant).
One-sample t-test (H1: mean C3−C1 < 0): t = −3.55, p = 0.0015.
The mean accuracy deficit for Class III across all 16 methods is −0.031 Pearson units.
Both tests are now significant. The t-test is significant at p < 0.002; the sign test, which only counts directions, is significant at p < 0.05. Re-scoring STATE on the DE20 metric resolved the metric mismatch that had made STATE the one apparent outlier in the original 15-method analysis — under a common metric, STATE shows the same Class III difficulty pattern as the other methods.
Figure 1. Class III accuracy deficit (C3 − C1) across 16 methods. Bars extending left indicate Class III is harder to predict. 12 of 16 methods show a negative difference.
The table below shows the mean Pearson correlation per class, averaged across the 14 benchmark methods and separately for STATE under both metrics.
| Class | N | Mean r (14 methods) | STATE (HVG) | STATE (DE20) |
|---|---|---|---|---|
| I | 748 | 0.780 | 0.097 | −0.003 |
| II | 8 | 0.856 | 0.032 | 0.088 |
| III | 14 | 0.745 | 0.119 | −0.034 |
| IV | 2 | 0.789 | 0.077 | 0.060 |
| V | 8 | 0.870 | 0.073 | 0.053 |
For the 14 benchmark methods, Class III (mean r = 0.745) is the lowest-accuracy class, consistent with H1. However, Classes II and V show higher accuracy than Class I, contrary to a simple monotonic gradient. This may reflect the small sample sizes for these classes or a genuine distinction: Class II (negative feedback) circuits may be inherently easier because negative feedback produces stable, self-correcting behavior that models can learn from training examples.
For STATE, the two metrics tell different stories. Under the HVG metric (2,000 genes), Class III appeared easiest — but this was an artifact of evaluating on a broad gene set where the signal-to-noise ratio is too low for reliable class-level distinctions. Under the DE20 metric (matching the benchmark evaluation), STATE shows Class III as the hardest class (r = −0.034 vs. Class I r = −0.003), consistent with the other methods. The metric mismatch, not the model, was responsible for the apparent discrepancy.
Figure 2. Per-gene mean prediction accuracy (14 benchmark methods) for Class I (blue, N = 748), Class III (orange, N = 14), and Class V (red, N = 8) genes. Each dot is one gene. Class III genes cluster in the lower half of the distribution. Dashed lines show class means.
A potential confound is that Class III genes might produce smaller perturbation effects (lower E-distance from control), making them inherently harder to detect regardless of circuit topology. We tested this for four representative methods (scGPT, GEARS, linearModel, trainMean) using three approaches:
CellOracle was used to infer a gene regulatory network from the K562 control cells (scRNA-seq). The fitted GRN coefficients were extracted, and for each of the 780 benchmark genes, we computed topology features from the inferred network: in-degree, out-degree, betweenness centrality, PageRank, strongly connected component membership, and participation in feedback cycles of length ≤5.
The primary finding from this analysis was counter-intuitive: genes participating in feedback loops in CellOracle’s GRN were, on average, easier to predict by grammar-blind methods, not harder. This may reflect the fact that highly connected, feedback-rich genes (e.g., major transcription factors like MYC, TP53) tend to produce large, stereotyped perturbation effects that are well-represented in training data.
However, when the analysis was restricted to Class III (bistable) genes specifically, the difficulty signal re-emerged: these genes showed lower mean accuracy than Class I genes, consistent with the meta-analysis finding. This suggests that the theoretically predicted difficulty is specific to positive feedback / bistable circuits, not to feedback in general — a refinement of the original hypothesis.
A natural objection to the central finding is that the Class III set (N = 14) is small and might be driven by a few outlier genes. We therefore attempted to expand the set by systematically identifying additional genes in the K562 benchmark that participate in documented bistable circuits. A literature search identified eight candidates with published evidence of positive feedback or bistability:
| Gene | Bistable Circuit | Reference | Mean r |
|---|---|---|---|
| CYCS | MOMP all-or-none apoptotic switch | Bagci et al. 2006 | 0.788 |
| ESPL1 | Separase–securin mitotic exit toggle | Pomerening et al. 2005 | 0.811 |
| FBXO5 | Emi1–APC/C interphase/mitosis switch | Cappell et al. 2016 | 0.800 |
| INCENP | CPC tension-sensing error correction | Lampson & Cheeseman 2011 | 0.839 |
| PPP1CA | CDK1–PP1 mitotic exit switch | Mochida et al. 2010 | 0.855 |
| FOXL2 | FOXL2/SOX9 sex determination toggle | Uhlenhaut et al. 2009 | 0.902 |
| BORA | Plk1–Aurora A–CDK1 positive feedback | Macurek et al. 2008 | 0.839 |
| SRF | SRF–MRTF–actin positive feedback | Olson & Nordheim 2010 | 0.957 |
Adding these 8 genes to Class III (N = 14 → 22) eliminated the statistical signal. The expanded Class III mean accuracy rose to 0.783 — essentially identical to Class I (0.779). The t-test became non-significant (t = 0.41, p = 0.66). The 8 new genes had mean r = 0.849 — above Class I, not below.
This negative result is informative. The 8 candidate genes participate in two categories of bistable circuit that differ fundamentally from the original 14:
By contrast, the original 14 Class III genes include persistent cell-identity bistable circuits: GATA1/PU.1 hematopoietic fate commitment, MYC proliferative autoactivation, VHL–HIF oxygen-sensing bistability. These circuits lock cells into stable attractor states that persist across the cell cycle. A grammar-blind model observing a transcriptomic snapshot cannot determine which attractor the cell has “crystallized” into — and therefore cannot reliably predict the perturbation response, which depends on the current attractor state.
We therefore retained the original 14-gene Class III set. The expansion attempt was not wasted: it revealed that the prediction difficulty is not a generic property of positive feedback, but a specific property of persistent bistable circuits that create hidden attractor states. This distinction — between transient cycling bistability and persistent fate bistability — is itself a refinement of the GLMP framework.
If persistent bistable circuits create hidden attractor states (Section 10.5), those states should leave a distributional fingerprint: cells committed to different attractors should express the gene at different levels, producing a bimodal expression distribution in the single-cell data. We tested this by computing Sarle’s bimodality coefficient (BC) for each gene across 10,691 control cells, after library-size normalization and log-transformation. BC > 5/9 ≈ 0.555 is the standard threshold indicating bimodality. As a complementary measure, we computed Ashman’s D, where D > 2 indicates clean separation of two modes.
| Class | N | Mean BC | Median BC | Frac BC > 5/9 | Mean Ashman D |
|---|---|---|---|---|---|
| I | 768 | 0.590 | 0.568 | 0.516 | 3.61 |
| II | 13 | 0.531 | 0.509 | 0.385 | 3.35 |
| III | 16 | 0.578 | 0.579 | 0.562 | 3.60 |
| IV | 2 | 0.627 | 0.627 | 0.500 | 3.70 |
| V | 9 | 0.675 | 0.579 | 0.667 | 4.10 |
Class III mean BC = 0.578 vs. Class I mean BC = 0.590. One-sided t-test (C3 > C1): t = −0.29, p = 0.61. Mann-Whitney: p = 0.55. Fisher exact on fraction above threshold: OR = 1.21, p = 0.45. Ashman’s D shows the same null result. Hypothesis 4 is not supported.
The null result is explained by a severe confound: bimodality coefficient is almost perfectly correlated with dropout sparsity (Pearson r = −0.983 between BC and fraction of nonzero cells). In scRNA-seq data, lowly expressed genes appear bimodal simply because many cells record zero counts (technical dropout) while others record nonzero counts. This creates a trivial zero/nonzero bimodality that dominates any biological signal.
The per-gene detail illustrates the problem. Among Class III genes, the highest bimodality coefficients belong to the most sparsely detected genes — RAD51 (BC = 0.82, only 25% of cells nonzero) and BUB1B (BC = 0.73, 35% nonzero). MYC, perhaps the most famous bistable circuit in cancer biology, has the lowest BC (0.42) because it is highly expressed (89% nonzero) and its dropout-driven bimodality is minimal. The bimodality coefficient is measuring detection probability, not biological bistability.
This does not mean Hypothesis 4 is wrong — it means that scRNA-seq mRNA snapshots are the wrong instrument for detecting the hidden attractor states postulated by the theory. The attractor states may be encoded in the chromatin accessibility landscape (measurable by scATAC-seq), in protein levels (not measurable by scRNA-seq), or in the combinatorial state of multiple genes rather than in any single gene’s marginal distribution. The fact that grammar-blind models do show a Class III accuracy deficit (Section 8.1) while the bimodality analysis does not detect bimodal distributions is itself consistent with the “hidden state” interpretation: the attractor states are hidden from simple distributional summaries just as they are hidden from the models.
CellOracle yields per-gene Pearson correlations on the K562 essential screen. Our CellOracle run scored 17 transcription factors; eight of these lie in the 780-gene benchmark intersection used for the 14 grammar-blind methods. This is too small for formal class-level hypothesis tests, but it supports a transparent case study of grammar advantage defined as CellOracle r minus the mean r across the 14 methods for the same gene (see run_celloracle_grammar_advantage.py and results/celloracle_grammar_advantage.tsv).
| Class | N (TFs in benchmark ∩ CellOracle) | Mean grammar advantage | Median grammar advantage |
|---|---|---|---|
| I | 5 | −0.796 | −0.856 |
| II | 2 | −0.880 | −0.880 |
| III | 1 (MYC) | −0.557 | −0.557 |
Advantage is negative throughout (CellOracle is far below the grammar-blind mean on this metric). For MYC, the only Class III overlap, CellOracle r = 0.260 vs. 14-method mean r = 0.817 (advantage −0.557), which is less negative than the Class I or II TF means — suggestive, but N = 1 for Class III and the comparison is TF-only.
We added a deliberately minimal grammar-aware baseline: signed linear diffusion on the TRRUST human network (Han et al. 2018). A perturbation is seeded as a negative impulse at the knocked-out gene; damped propagation along activation (+1) and repression (−1) edges yields a predicted shift vector, scored against observed shifts using the same top-20 DE genes per perturbation as the DE20 rescoring (see run_grn_propagation.py).
On the essential-screen object (310,385 cells × 8,563 genes), TRRUST contributed 2,134 nonzero entries in the full gene×gene adjacency; 242 perturbations met inclusion criteria (non-control, ≥5 cells, TRRUST connectivity, nonzero DE variance). Mean Pearson r across scored genes was 0.296 (median 0.290). Class-stratified means on scored genes were Class I 0.308 (N = 211), II 0.325 (N = 12), III 0.128 (N = 14), IV 0.178 (N = 2), V 0.196 (N = 3). On the 780-gene benchmark subset intersecting scored genes, Class I mean r = 0.294 (N = 81), Class III mean r = 0.133 (N = 12).
When GRN_Propagation is appended to the Section 8.1 meta-analysis alongside the 14 benchmark methods, STATE_HVG, and STATE_DE20 (merge_hypothesis2_meta.py), the Class I vs. Class III contrast is C3−C1 = −0.161 — a larger Class III deficit than for typical grammar-blind methods (e.g. GEARS C3−C1 = −0.047). Across 17 methods, 13 show C3 < C1 (sign test p = 0.0245; one-sample t = −3.44, p = 0.0017). Thus a topology-only TRRUST baseline does not rescue Class III under this metric; if anything, it underperforms the blind ensemble on the class gap. Per-gene scores: results/grn_propagation_scores.tsv.
| Hypothesis | Verdict | Summary |
|---|---|---|
| H1: Accuracy Gradient | Partially Supported | Class III (bistable) is harder than Class I across 12/16 methods (p = 0.0015). But the full monotonic gradient I → V does not appear — Classes II and V are not intermediate. |
| H2: Grammar-Aware Advantage | Tested; Not Supported | CellOracle case study (Section 8.7): grammar advantage negative on eight benchmark TFs. TRRUST propagation baseline (Section 8.8): Class III mean r below Class I; meta-analysis C3−C1 = −0.16 vs. ≈−0.05 for typical blind methods. Topology and literature GRN alone do not outperform grammar-blind scores here. |
| H3: Class I Ceiling | Supported | Class I mean accuracy = 0.78 across 14 benchmark methods, exceeding the 0.7 threshold. |
| H4: Bistability Signature | Not Supported (Confounded) | Bimodality analysis found no significant difference between Class III and Class I (BC: p = 0.61). However, scRNA-seq dropout sparsity dominates the bimodality coefficient (r = −0.98), rendering the test uninformative. Requires protein-level or chromatin-accessibility data. |
| H5: Class V Hard Ceiling | Not Supported | Class V genes (N = 8) show higher average accuracy (0.87) than Class I (0.78). Sample size is too small for strong conclusions. |
| H6: DEG Recovery Gradient | Not Tested | DEG recovery F1 scores were not available from the benchmark data format. |
| H7: Network Concordance | Partially Addressed | CellOracle GRN topology was analyzed but formal Cohen’s kappa concordance testing was not completed. |
The most robust finding is that Class III (bistable/positive feedback) genes are systematically harder for grammar-blind models to predict than Class I (feed-forward) genes. This effect is small in absolute terms (mean accuracy deficit ≈ 0.03 Pearson units) but highly consistent across architecturally diverse methods: transformers, graph neural networks, variational autoencoders, meta-learning models, virtual cell models, and simple linear baselines all show the same direction. The one-sample t-test (p = 0.0015) confirms that the mean effect is significantly different from zero.
This is the first empirical evidence that computational complexity class — as defined by regulatory circuit topology — correlates with the difficulty of predicting perturbation responses. It does not prove a causal relationship, but it is consistent with the theoretical prediction that bistable circuits contain hidden state (which attractor the cell is in) that grammar-blind models cannot directly observe.
The expansion analysis (Section 8.5) sharpens this interpretation. The difficulty is not a property of positive feedback per se, but of persistent bistable circuits where the attractor state constitutes a hidden variable. In these circuits, the product of the regulatory process sustains the conditions for its own continued production — once the circuit “nucleates” into one attractor state, it remains there independently of the initiating signal, like a crystal growing outward from a seed. A grammar-blind model observing a snapshot of the transcriptome sees the current expression state but cannot see which nucleation event has already occurred, which attractor basin the cell has committed to, or which direction the self-sustaining loop is running. This is the specific information that a grammar-aware model — one that represents the circuit topology explicitly — could in principle recover.
Hypothesis 1 predicted a monotonic accuracy decrease from Class I through Class V. This is not observed. Classes II (negative feedback) and V (self-modifying chromatin) show average accuracy higher than Class I, not lower.
We interpret this through the lens of the classification itself:
The framework may need revision on two fronts. First, rather than a monotonic five-class gradient, the operative distinction for prediction difficulty may be narrower: persistent bistable circuits (a subset of Class III) vs. everything else. Second, the expansion analysis (Section 8.5) demonstrates that positive feedback is necessary but not sufficient for prediction difficulty — the bistable circuit must create a persistent hidden state that is not well-sampled in the training data. Cell cycle bistable switches (mitotic entry/exit) are transient and well-sampled; cell fate bistable switches (hematopoietic commitment, proliferative lock-in) are persistent and poorly sampled. This is a refinement, not a rejection, of the original conjecture — and it generates a testable sub-prediction for future work.
STATE (Arc Institute) is the most architecturally advanced model in our analysis — a virtual cell model trained on over 100 million perturbed cells across 70 contexts. Its inclusion is important for covering the state of the art. The initial evaluation on 2,000 HVGs (mean r = 0.088) showed STATE as an apparent outlier in the class-stratified analysis, with Class III appearing easier rather than harder. This raised the question of whether STATE’s architecture genuinely handles bistable circuits better, or whether the discrepancy was an artifact of the different evaluation metric.
Re-scoring STATE on the top-20 DE genes per perturbation — the same metric used by the 14 benchmark methods — resolved this question definitively. Under the DE20 metric, STATE shows Class III as harder than Class I (C3−C1 = −0.031), consistent with every other method that uses the same metric. The metric mismatch, not the model architecture, was responsible for the discrepancy.
A secondary finding is that STATE’s absolute performance on DE20 (mean r = 0.003) is dramatically lower than the benchmark methods (mean r ≈ 0.78), and worse than its own HVG score. This suggests that STATE’s zero-shot predictions do not capture the specific genes with the largest perturbation effects — a finding that may be relevant to evaluating virtual cell models more broadly, but does not affect the class-stratified analysis, which compares within-method class differences.
If Class III genes are harder for grammar-blind models because these models cannot represent bistable topology, then grammar-aware models should show a smaller (or absent) Class III penalty. Sections 8.7–8.8 add two direct probes. Neither supports the strong form of Hypothesis 2: CellOracle remains far below the grammar-blind mean on the overlapping TF set, and the TRRUST diffusion baseline shows a steeper Class I–III gap than most blind methods, not a shallower one. Interpreting this requires care: TRRUST is incomplete, undirected effects and context are omitted, and the baseline does not learn from expression. The fair conclusion is narrower: knowing a literature GRN and running a simple signed propagation is not sufficient to beat grammar-blind predictors on DE20, and Class III remains hard (or harder) for that baseline. Richer grammar-aware models (trained simulators, multi-omic state) remain untested at genome scale.
Published tooling gap. Few perturbation predictors take a curated directed GRN with signed edges as a first-class input for arbitrary gene knockdowns; CellOracle is GRN-based but TF-limited; GEARS uses a functional graph rather than regulatory edges. That gap delayed a clean Hypothesis 2 test; the propagation baseline is an attempt to close it with a transparent, reproducible reference implementation.
The persistent bistability finding invites a deeper question about the nature of biological conditionals. In classical computation, a conditional P → Q is propulsive: the input P causes the output Q. Causation flows left to right, from signal to response, like a domino falling and pushing the next. But in many biological regulatory circuits, the direction of physical causation runs the other way: the product Q creates the conditions for its own continued production, pulling P into the relationship rather than being pushed by it. We call these attractive conditionals.
The distinction is precise. Consider a transcription factor that activates its own promoter (MYC autoactivation, GATA1 self-reinforcement). The formal logic reads: if MYC protein is present, then MYC gene is transcribed. But the physical reality is that MYC protein sustains the conditions for its own production — Q maintains P, not the other way around. The conditional is not a one-way gate but a self-sustaining loop with a preferred formal direction and an actual thermodynamic pull that runs against it.
This is the siphon problem: once started, the process sustains itself without the initiating signal. The human sucks on the siphon pipe to start the flow; once started, gravity maintains it. The initiating cause is no longer necessary. In biological terms, a developmental signal (the “sucking”) triggers GATA1 expression above a threshold; GATA1 then activates its own promoter and the circuit locks into the high-GATA1 attractor state independently of the original signal. The siphon is running.
The crystallization metaphor sharpens this further. The first crystal bond — the nucleation event — is the hardest step: it requires the most improbable conditions, the highest energy barrier, the most precise molecular alignment. Once that first bond forms, every subsequent bond is easier because the existing structure provides a template. The product scaffolds its own continuation. In the gene regulatory context, nucleation is the moment a cell crosses the threshold from one attractor basin to another: the first few molecules of GATA1 that tip the autoactivation loop into self-sustaining mode, committing the cell to erythroid fate.
This connects directly to the empirical finding. Grammar-blind models fail on persistent bistable circuits because the nucleation event has already happened and left no unambiguous trace in the transcriptomic snapshot. The model sees a cell expressing GATA1 at some level, but cannot determine whether the autoactivation loop has nucleated (the cell is committed to erythroid fate and the siphon is running) or whether the expression reflects a transient fluctuation that will decay. The two states produce different perturbation responses, but the snapshot alone cannot distinguish them. The hidden variable is not just which attractor the cell is in but whether the nucleation event has occurred — whether the crystal has started growing.
By contrast, transient bistable circuits (the cell-cycle switches that the expansion analysis identified as easy to predict) do not pose this problem. Every K562 cell passes through both sides of the mitotic switch every division. There is no irreversible nucleation — the switch resets after each cycle. Both states are abundantly sampled in the training data, and a grammar-blind model has ample examples to learn the perturbation response from either state.
The attractive conditional framework suggests that the GLMP complexity classification may benefit from a finer distinction within Class III: circuits where the product merely amplifies its own production (transient positive feedback) vs. circuits where the product sustains the conditions for its own existence and the initiating signal is no longer necessary (persistent attractive conditionals). Only the latter create the hidden nucleation states that grammar-blind models cannot resolve. This distinction — between propulsive and attractive conditionals in biological regulation — is developed further in a companion theoretical treatment (in preparation).
If circuit class predicts accuracy, then aggregate benchmark scores are misleading. A model reporting r = 0.78 is averaging over a mixture: high accuracy on feed-forward genes and lower accuracy on bistable genes. Two models with identical aggregate scores could differ dramatically in their Class III performance — and the one that handles bistable circuits better is arguably the more biologically meaningful model.
We recommend that future perturbation prediction benchmarks report class-stratified accuracy alongside aggregate scores. This requires only a publicly released gene circuit classification table (which we provide as gene_circuit_classes.tsv) and a re-analysis of existing benchmark outputs.
We performed the first empirical test of the conjecture that computational complexity class — determined by gene regulatory circuit topology — predicts the accuracy of AI models of cellular behavior. Analyzing 780 genes across 16 grammar-blind perturbation prediction methods (including Arc Institute’s STATE, re-scored on a common metric), we find a statistically significant effect: genes regulated by Class III (bistable/positive feedback) circuits are systematically harder to predict than genes regulated by Class I (feed-forward) circuits (t = −3.55, p = 0.0015). The effect is consistent across diverse model architectures, from simple linear baselines to state-of-the-art virtual cell models.
The full five-class monotonic gradient predicted by Hypothesis 1 does not appear: Classes II (negative feedback) and V (self-modifying) do not show the expected intermediate difficulty, though sample sizes for these classes are small. A robustness test that expanded Class III to include transient cell-cycle bistable genes (N = 14 → 22) abolished the signal, revealing that the difficulty effect is specific to persistent bistable circuits — those that lock cells into stable attractor states (cell fate commitments, proliferative lock-in) — rather than transient bistable switches that every cell cycles through. This refines the original conjecture: the theoretically predicted difficulty arises not from positive feedback per se, but from the hidden attractor states that persistent bistable circuits create and that grammar-blind models, observing a transcriptomic snapshot, cannot distinguish.
Hypothesis 2 was addressed with a TF-only CellOracle case study and a TRRUST propagation baseline (Sections 8.7–8.8). Neither shows the hypothesized grammar advantage; richer grammar-aware architectures at full-genome scale remain the most important open direction.
We recommend that future virtual cell model benchmarks report class-stratified accuracy alongside aggregate scores. The gene circuit classification table (gene_circuit_classes.tsv) released with this paper enables immediate adoption of this practice.
This work was conducted as an independent research project. STATE model inference was run on Google Colab Pro (Tesla T4 GPU). The 14-method benchmark scores are derived from the supplementary data of Kernfeld et al. (2025). The TRRUST regulatory network (Han et al. 2018) provided the foundation for gene circuit classification. We thank the authors of these resources for making their data publicly available.
All data and code necessary to reproduce the analyses in this paper are available in the public GitHub repository:
k562-empirical-sequel/; tabulated scores and figures in results/; gene_circuit_classes.tsv at repository root for script compatibility)gene_circuit_classes.tsv (repository root) — GLMP complexity class (I–V) for each of the 780 benchmark genes, with classification rationale and literature referencesempirical_sequel_draft.html (GCS); repository copy k562-empirical-sequel/empirical_sequel_draft_v2.htmlresults/merged_scores.tsv (14 benchmark methods), results/state_k562_per_gene_scores.tsv (STATE HVG), results/state_k562_de20_scores.tsv (STATE DE20), results/grn_propagation_scores.tsv (TRRUST signed propagation, DE20 metric), results/celloracle_grammar_advantage.tsv (CellOracle vs 14-method mean)results/grn_topology_features.tsv, results/grn_topology_vs_accuracy.tsvresults/bimodality_analysis.tsv, results/bimodality_report.txtk562-empirical-sequel/scripts/ — merge_state_results.py, merge_state_de20_results.py, merge_hypothesis2_meta.py, run_grn_propagation.py, run_celloracle_grammar_advantage.py, run_bimodality_analysis.py, compute_de_genes.py (run from repository root; requires Figshare .h5ad and regulatory_data/trrust_rawdata.human.tsv for propagation)k562-empirical-sequel/STATE_K562_Benchmark_v4.ipynb (inference), k562-empirical-sequel/STATE_Rescore_DE20.ipynb (DE20 re-scoring)The K562 Perturb-seq benchmark data (Replogle et al. 2022) is available at Figshare. The GLMP flowchart database and companion theoretical papers are hosted at the GLMP HuggingFace Space and GLMP database.
This study can be reproduced with the following steps:
ReplogleWeissman2022_K562_essential.h5ad) from Figshare.git clone https://github.com/garywelz/glmpSTATE_K562_Benchmark_v4.ipynb on Google Colab (requires GPU runtime, ~2 hours on T4).STATE_Rescore_DE20.ipynb on Google Colab (~15 minutes).python3 merge_state_results.py and python3 merge_state_de20_results.pypython3 run_grn_propagation.py then python3 merge_hypothesis2_meta.py; CellOracle comparison: python3 run_celloracle_grammar_advantage.py (requires results/celloracle_k562_per_gene_scores.tsv)python3 run_bimodality_analysis.pyAll analyses were performed using Python 3.12 with NumPy, SciPy, pandas, scanpy, and anndata. The STATE model (arcinstitute/ST-HVG-Parse) was accessed from HuggingFace Hub. A standard Google Colab Pro environment with T4 GPU and High-RAM runtime is sufficient for all computations.
Companion papers: Welz, G. Primitive Relations, Computational Complexity, and a Conjecture on the Genomic Computational Class. GLMP Working Paper, 2026. Paper I. — Welz, G. The Genome as Computer: Logical Primitives, Runtime States, and the Computational Limits of Biological Prediction. GLMP Working Paper, 2026. Paper II. — Welz, G. Circuit Class Predicts Virtual Cell Model Accuracy (this empirical sequel). Paper III.