Research Folio · No. I · MMXXVI

Gabriel
Abdelnoor

Mostly virology — by way of agents

I was never much of a coder. I let the agents write the code and I worry about the biology — which makes the loop from question to result a lot tighter.

Scroll

About the Naturalist

A specimen cabinet of obsessions

five drawers, one through-line: do the models actually hold up?
Drawer I

Virology

RNA viruses of coral reefs — mostly dark matter: unclassified, eukaryote-host, waiting to be organized.

Drawer II

Foundation Models

ESM2, EVO2, and what their embeddings genuinely encode — versus what we wish they did.

Drawer III

Agentic Workflows

I don't write the code — agents do. I shape the question and check the biology; the trajectory is the lab notebook.

Drawer IV

Coral Holobionts

The reef as a single organism: host, symbiont, microbe, and virus, read together.

Drawer V

Statistical Rigor

Mantel tests and permutation nulls — and reporting the weak r honestly, including when I've been scooped.

Askthe biology
Buildthe agents
Testthe stats

Bound Research Folio · Centerpiece

EVO2 on out-of-distribution coral RNA viruses

How much taxonomic structure survives in a nucleotide model on viruses it was built to be bad at?

What I actually found

Across families, EVO2's embeddings don't beat a ruler — the apparent taxonomy correlation is mostly a genome-length confound and collapses when you control for it (partial r ≈ +0.13, n.s.). But within the clade where these coral dark-matter viruses live, the geometry tracks taxonomy independent of length: Mantel r = +0.345, partial r = +0.285, both p = 0.0001.

within the sobemo-like clade
real, length-independent signal
across all of Riboviria
mostly a genome-length ruler

Caveats, up front: the taxonomy panels are genuinely small (43 cross-family / 63 within-clade taxa); EVO2 was separately benchmarked at ~noise for generating eukaryote-host viruses; and LucaProt (Hou et al. 2024, Cell) had already tackled this question with far more compute.

Panel A

Across families: it's a length confound

Cross-family panel — 43 taxa across 11 classes, full Riboviria breadth. EVO2-7B block-26 embeddings (4096-dim, cosine) vs a curated NCBI taxonomic-distance tree; Mantel (Spearman, 9999 perms), partial controlling log10(length).

TestMantel rpReading
full genome ~ taxonomy+0.1810.026weak, significant on its own
homologous RdRp ~ taxonomy+0.1290.098already marginal
length ~ taxonomy+0.248length is the strongest single predictor
taxonomy ~ EVO2 | length+0.1330.104collapses to n.s. once length is controlled

A model that understood phylogeny should beat a ruler. Across families, it doesn't.

Panel B

Within families: is it generalization, or recognition?

Per-family Mantel r — but each bar is tagged by whether that family is plausibly in- or out-of-distribution for Evo2's nucleotide training. Hover a bar for the reasoning. The uncomfortable pattern: the signal piles up in the families Evo2 most likely already saw.

Figure 1. Family-level Mantel r, tagged in/out of distribution. The mitochondrial, heavily-endogenized Mitoviridae tops the chart while its cytoplasmic sister Narnaviridae flatlines — a sign the within-clade signal may track training exposure, not phylogenetic generalization. * the coral ASVs' clade. In/out calls are reasoned, not from Evo2's training manifest.

Panel C

The headline, within the clade

Verified within-clade panel — 63 taxa (Solemoviridae, Narnaviridae, Tombusviridae and their sisters), homologous RdRp, identity-gated. This small, careful sample is what the whole claim rests on.

TestValuepReading
Mantel: taxonomy ~ EVO2r = +0.3450.0001real correlation with curated taxonomy
partial: taxonomy ~ EVO2 | lengthr = +0.2850.0001survives controlling for length — not a ruler
family LOO probe (EVO2)0.889 (56/63)0.0033vs 0.238 chance — 3.7× above null
family LOO probe (ESM2)0.984 (62/63)protein model nails it — EVO2 is surprisingly close
genus LOO probe≈ 0.73chance ≈ 0.18; genus-level resolution exists
Panel D

The composition layer — what's actually driving the signal

It's not phylogeny alone. EVO2's cosine distance correlates with tetranucleotide composition (r ≈ 0.43) and length (r ≈ 0.41) — and so does ESM2's. This is a mean-pooling artifact across both models. But controlling for composition, a deeper residual remains: r ≈ 0.27, p = 1e-4.

Confounding layerEVO2 cosine rESM2 cosine rWhat it means
~ sequence length≈ 0.41≈ 0.41mean-pooling artifact, not nucleotide-specific
~ tetranucleotide composition≈ 0.43similarcomposition tracks taxonomy too — confound or signal?
composition-controlled residual≈ 0.27deeper than composition — what's left after the confounds

Taxonomy ruler robustness: rank metric r = 0.345 · Kimura-corrected r = 0.317 · patristic r = 0.181. The signal holds across distance metrics.

Panel E

Concatenation: EVO2 + ESM2

When you concatenate the nucleotide model with the protein model, taxonomy signal jumps and the length confound drops — they're complementary.

0.32
EVO2 alone
taxonomy Mantel r
0.47
EVO2 + ESM2
taxonomy Mantel r
0.41
EVO2 length confound
0.26
Concatenated length confound

Nucleotide + protein models are complementary — concatenation strengthens taxonomy and weakens the length confound.

Panel F

"Dark matter" is database-relative — the type-II demonstration

The 801 coral RdRp ASVs are "dark" because they don't match curated references. But they're not unknowable — DIAMOND blastp against the full RefSeq-viral database tells a different story.

0vs curated panel

0 / 40 hits

Even ultra-sensitive DIAMOND: no confident hits against the curated reference panel. These proteins are genuinely absent from the taxonomy backbone.

40vs RefSeq-viral

40 / 40 hits

Against the full RefSeq-viral database, every ASV hits — best matches are Beihai sobemo-like virus, ~43% identity, e ≈ 1e-36.

↳ "dark" = absent from curated refs, not unknowable

The 471 bp ASVs are truncated RdRp fragments — the full-length protein exists in RefSeq but the amplicon captures only a sliver. The "darkness" is a database-boundary problem, not a biological one.

Panel G

What didn't work

Framed and spotlit, not hidden. The dead ends are the part that proves the method.

Humbling

LucaProt didn't come up

LucaProt — Hou et al. 2024, Cell — didn't surface in my literature search. I thought a solved problem was an open avenue, and was outclassed by teams with GPU clusters years ago.

Confounded

The cross-family signal

It looked like EVO2 had learned phylogeny across all of Riboviria. Controlling for genome length collapsed it to non-significant. It was mostly a ruler.

Made it fit

Stock runtime OOM'd on 12 GB

EVO2-7B wouldn't load on the 3060.

↳ custom StripedHyena CPU layer offload, checked vs the official runtime

An Intermediate Step

The 3D UMAP — artifact or signal?

I projected all 801 embeddings into 3D and flew through them.

Out of curiosity, I did a 3D UMAP projection of the EVO2 embeddings and spun through the point cloud. The pattern was immediate: dark matter clusters near known viruses — sometimes within the same family, sometimes threading between families. Taxonomy-only clusters, dark-matter-only clusters, and dark matter co-clustering with taxonomically defined sequences. I thought that alone was significant.

But then the question hit: is this UMAP artifact or is EVO2-7B actually representing embeddings? UMAP is a non-linear projection — it manufactures local neighborhoods that may not exist in the high-dimensional space. The tight clusters could be the model learning something real about viral structure, or they could be UMAP collapsing the manifold. I needed a null hypothesis.

The question that drove the rest of the project: does the geometry in this point cloud track phylogeny, or is it a projection artifact? The Mantel test answered it — r = 0.345 within the clade, surviving length control at r = 0.285, both p = 0.0001. The clustering was real.

WASD = rotate · Shift+WASD = fly · scroll = zoom · hover any point for details

Field Note

I was never much of a coder. I never got around to learning it — but I've found cautious success letting agents write the code while I worry about the biology. That teaming turns a hypothesis bugging me at one in the morning into an experiment by the next; the loop from question to result got a lot tighter.

This project is the clearest example I have. I heard about EVO2 a few months ago — a model that learned to represent raw nucleotides — and a question stuck in my head: how does that structure hold up on out-of-distribution eukaryote-host viruses, and could you train a classifier on top? That any taxonomic signal survives at all, in a model benchmarked at noise for these sequences, is a small reminder of how far these models generalize.

The Honest Read

The honest read.

LucaProt — Hou et al. 2024, Cell — didn't surface in my literature search. I had a hypothesis bugging me and jumped straight into building, only to find out halfway through that a team with GPU clusters had already solved this two years ago. I thought a solved problem was an open avenue.

I'm still glad I was curious enough to take it as far as I did. Across families the correlation is weak and mostly a length confound; the broad version of the question has a better answer elsewhere. But within the sobemo-like clade where these coral viruses live, a Mantel test found real, length-independent structure in EVO2's embeddings — r = 0.345, p = 0.0001 — on a model built to be bad at exactly these sequences. And when I concatenate EVO2 with ESM2, the taxonomy signal jumps to 0.47 while the length confound drops. The nucleotide model is learning something the protein model doesn't have, and vice versa. That's the part I keep coming back to.

— G.A.

From the Archive

What made it possible

A custom EVO2 runtime — StripedHyena with CPU offloading of layers, cross-checked against the official PyTorch runtime — so a genomic foundation model could run on a 12 GB RTX 3060, with the 2700X / 16 GB DDR4 rig doing the heavy lifting.