The Drug Discovery Pipeline

One possible pipeline and the fragmented set of models that could be helpful at each stage.

01
Target
identification
Which gene drives
disease?
02
Target structure
What does it look
like?
03
Hit discovery
Find candidate
molecules
04
Patient
stratification
Who will respond?
05
Biomarker
discovery
How do we measure
response?
06
Tissue pathology
Validate in biopsy
01Genomic / DNA
Target identification
Which gene drives disease?
Evo 2
Arc Institute
In: DNA sequenceOut: variant effects
40B params, 9T nucleotides. Genome-scale variant scoring across all domains of life.
Enformer
Google DeepMind
In: 200kb DNA windowOut: gene expression, chromatin
Maps long-range regulatory sequences to expression. Captures distal enhancer effects.
Nucleotide Transformer
InstaDeep / EMBL-EBI
In: nucleotide k-mersOut: genomic embeddings
Trained on 3,200 genomes. Strong on regulatory annotation and variant effect prediction.
DNABERT-2
UIUC
In: DNA tokensOut: sequence embeddings
Multi-species model. Promoter detection, TF binding, splice site prediction.
02Protein structure
Target structure & druggability
What does it look like?
AlphaFold 3
Google DeepMind · 2024 Nobel
In: sequence + ligandsOut: 3D complex structure
Diffusion-based. Models protein–DNA–RNA–small molecule complexes at atomic resolution.
ESM-2 / ESM-3
Meta AI
In: AA sequenceOut: embeddings, structure
Up to 15B params. Zero-shot fitness prediction and single-sequence structure inference.
RoseTTAFold All-Atom
Baker Lab, UW · Science 2024
In: sequence + small moleculesOut: full biomolecular assembly
Three-track architecture. Models proteins, nucleic acids, metals, and small molecules simultaneously.
OpenFold3
OpenFold Consortium · Oct 2025
In: protein + ligandOut: 3D structure
Fully open-source AF3-class model. 300K+ experimental + 13M synthetic structures. Community fine-tunable.
03Small molecule
Hit discovery & lead optimization
Find candidate molecules
ChemBERTa-2
Valence Labs / Recursion
In: SMILES stringsOut: property predictions
77M SMILES from PubChem. Toxicity, solubility, and bioactivity classification.
RFDiffusion
Baker Lab, UW
In: protein pocket geometryOut: de novo drug candidates
Diffusion-based generative design of novel binders and scaffolds from 3D pocket structure.
MolBERT
BenevolentAI
In: SMILES + physicochemicalOut: ADMET predictions
Combines sequence and property-based pretraining. Drug-likeness and ADMET scoring.
Tx-LLM
Google DeepMind · Nature 2024
In: molecule + disease textOut: clinical outcome predictions
Fine-tuned on 709 therapeutic tasks. Links chemical structure to clinical trial endpoints.
04Single-cell
Patient stratification
Who will respond?
Cell2Sentence
Yale / Google Research · ICML 2024
In: scRNA-seq as gene-name textOut: cell type, perturbation, text
Converts expression profiles to "cell sentences." Enables LLM reasoning over single-cell biology. Scales to 27B params.
scGPT
University of Toronto
In: scRNA-seq countsOut: cell embeddings, perturbation
33M human cells, 51 organs. Best-in-class for rare cell type prediction and in silico perturbation.
Geneformer
Broad Institute
In: rank-ordered genesOut: gene network inference
Rank-value encoding. Excels with limited labels. Used for disease modeling and network dosage sensitivity.
scFoundation
Tsinghua University
In: full transcriptomeOut: cell-type classification
Read-depth-aware pre-training on 50M+ cells. Top zero-shot cell type annotation performance.
05Multimodal
Biomarker discovery
How do we measure response?
Nicheformer
Helmholtz Munich
In: scRNA + spatial coordsOut: spatial cell states, niches
Integrates single-cell and spatial omics. Models cell neighborhood context for tissue microenvironment.
LucaOne
BGI Research · Nature Machine Intelligence
In: nucleic acid + proteinOut: unified embeddings
Pre-trained on 169,861 species. Emergent understanding spanning the central dogma.
Evo (original)
Arc Institute
In: DNA → RNA → proteinOut: cross-modal generation
First model trained across the central dogma. StripedHyena architecture for genome-length context.
Bioptimus H-series
Bioptimus
In: histology + transcriptomics + clinicalOut: unified tissue representations
Multimodal "world model" roadmap. Jointly trains on pathology, spatial omics, and clinical metadata.
06Pathology imaging
Tissue pathology & validation
Validate in biopsy
H-Optimus-0
Bioptimus · 1.1B params
In: histopathology slide tilesOut: tissue embeddings
500K+ slides, 4,000 clinical centers. First in planned multimodal bio world model series.
UNI
Harvard MGB / HEST
In: whole-slide image tilesOut: pathology embeddings
Self-supervised on 100K+ slides. Zero-shot cancer subtyping, mutation prediction, survival analysis.
Prov-GigaPath
Microsoft / Providence Health · Nature
In: gigapixel slide tilesOut: slide-level representations
1.3B tissue image tiles. Long-range spatial attention across entire slides.
CONCH
Harvard / Memorial Sloan Kettering
In: pathology image + textOut: vision-language embeddings
Contrastive vision-language model. Zero-shot cancer classification via text-prompted queries over tissue.