Virology
RNA viruses of coral reefs — mostly dark matter: unclassified, eukaryote-host, waiting to be organized.
Research Folio · No. I · MMXXVI
Mostly virology — by way of agents
I was never much of a coder. I let the agents write the code and I worry about the biology — which makes the loop from question to result a lot tighter.
About the Naturalist
RNA viruses of coral reefs — mostly dark matter: unclassified, eukaryote-host, waiting to be organized.
ESM2, EVO2, and what their embeddings genuinely encode — versus what we wish they did.
I don't write the code — agents do. I shape the question and check the biology; the trajectory is the lab notebook.
The reef as a single organism: host, symbiont, microbe, and virus, read together.
Mantel tests and permutation nulls — and reporting the weak r honestly, including when I've been scooped.
Bound Research Folio · Centerpiece
Across families, EVO2's embeddings don't beat a ruler — the apparent taxonomy correlation is mostly a genome-length confound and collapses when you control for it (partial r ≈ +0.13, n.s.). But within the clade where these coral dark-matter viruses live, the geometry tracks taxonomy independent of length: Mantel r = +0.345, partial r = +0.285, both p = 0.0001.
Caveats, up front: the taxonomy panels are genuinely small (43 cross-family / 63 within-clade taxa); EVO2 was separately benchmarked at ~noise for generating eukaryote-host viruses; and LucaProt (Hou et al. 2024, Cell) had already tackled this question with far more compute.
Cross-family panel — 43 taxa across 11 classes, full Riboviria breadth. EVO2-7B block-26 embeddings (4096-dim, cosine) vs a curated NCBI taxonomic-distance tree; Mantel (Spearman, 9999 perms), partial controlling log10(length).
| Test | Mantel r | p | Reading |
|---|---|---|---|
| full genome ~ taxonomy | +0.181 | 0.026 | weak, significant on its own |
| homologous RdRp ~ taxonomy | +0.129 | 0.098 | already marginal |
| length ~ taxonomy | +0.248 | — | length is the strongest single predictor |
| taxonomy ~ EVO2 | length | +0.133 | 0.104 | collapses to n.s. once length is controlled |
A model that understood phylogeny should beat a ruler. Across families, it doesn't.
Per-family Mantel r — but each bar is tagged by whether that family is plausibly in- or out-of-distribution for Evo2's nucleotide training. Hover a bar for the reasoning. The uncomfortable pattern: the signal piles up in the families Evo2 most likely already saw.
Figure 1. Family-level Mantel r, tagged in/out of distribution. The mitochondrial, heavily-endogenized Mitoviridae tops the chart while its cytoplasmic sister Narnaviridae flatlines — a sign the within-clade signal may track training exposure, not phylogenetic generalization. * the coral ASVs' clade. In/out calls are reasoned, not from Evo2's training manifest.
Verified within-clade panel — 63 taxa (Solemoviridae, Narnaviridae, Tombusviridae and their sisters), homologous RdRp, identity-gated. This small, careful sample is what the whole claim rests on.
| Test | Value | p | Reading |
|---|---|---|---|
| Mantel: taxonomy ~ EVO2 | r = +0.345 | 0.0001 | real correlation with curated taxonomy |
| partial: taxonomy ~ EVO2 | length | r = +0.285 | 0.0001 | survives controlling for length — not a ruler |
| family LOO probe (EVO2) | 0.889 (56/63) | 0.0033 | vs 0.238 chance — 3.7× above null |
| family LOO probe (ESM2) | 0.984 (62/63) | — | protein model nails it — EVO2 is surprisingly close |
| genus LOO probe | ≈ 0.73 | — | chance ≈ 0.18; genus-level resolution exists |
It's not phylogeny alone. EVO2's cosine distance correlates with tetranucleotide composition (r ≈ 0.43) and length (r ≈ 0.41) — and so does ESM2's. This is a mean-pooling artifact across both models. But controlling for composition, a deeper residual remains: r ≈ 0.27, p = 1e-4.
| Confounding layer | EVO2 cosine r | ESM2 cosine r | What it means |
|---|---|---|---|
| ~ sequence length | ≈ 0.41 | ≈ 0.41 | mean-pooling artifact, not nucleotide-specific |
| ~ tetranucleotide composition | ≈ 0.43 | similar | composition tracks taxonomy too — confound or signal? |
| composition-controlled residual | ≈ 0.27 | — | deeper than composition — what's left after the confounds |
Taxonomy ruler robustness: rank metric r = 0.345 · Kimura-corrected r = 0.317 · patristic r = 0.181. The signal holds across distance metrics.
When you concatenate the nucleotide model with the protein model, taxonomy signal jumps and the length confound drops — they're complementary.
Nucleotide + protein models are complementary — concatenation strengthens taxonomy and weakens the length confound.
The 801 coral RdRp ASVs are "dark" because they don't match curated references. But they're not unknowable — DIAMOND blastp against the full RefSeq-viral database tells a different story.
Even ultra-sensitive DIAMOND: no confident hits against the curated reference panel. These proteins are genuinely absent from the taxonomy backbone.
Against the full RefSeq-viral database, every ASV hits — best matches are Beihai sobemo-like virus, ~43% identity, e ≈ 1e-36.
The 471 bp ASVs are truncated RdRp fragments — the full-length protein exists in RefSeq but the amplicon captures only a sliver. The "darkness" is a database-boundary problem, not a biological one.
Framed and spotlit, not hidden. The dead ends are the part that proves the method.
LucaProt — Hou et al. 2024, Cell — didn't surface in my literature search. I thought a solved problem was an open avenue, and was outclassed by teams with GPU clusters years ago.
It looked like EVO2 had learned phylogeny across all of Riboviria. Controlling for genome length collapsed it to non-significant. It was mostly a ruler.
EVO2-7B wouldn't load on the 3060.
An Intermediate Step
Out of curiosity, I did a 3D UMAP projection of the EVO2 embeddings and spun through the point cloud. The pattern was immediate: dark matter clusters near known viruses — sometimes within the same family, sometimes threading between families. Taxonomy-only clusters, dark-matter-only clusters, and dark matter co-clustering with taxonomically defined sequences. I thought that alone was significant.
But then the question hit: is this UMAP artifact or is EVO2-7B actually representing embeddings? UMAP is a non-linear projection — it manufactures local neighborhoods that may not exist in the high-dimensional space. The tight clusters could be the model learning something real about viral structure, or they could be UMAP collapsing the manifold. I needed a null hypothesis.
The question that drove the rest of the project: does the geometry in this point cloud track phylogeny, or is it a projection artifact? The Mantel test answered it — r = 0.345 within the clade, surviving length control at r = 0.285, both p = 0.0001. The clustering was real.
WASD = rotate · Shift+WASD = fly · scroll = zoom · hover any point for detailsField Note
I was never much of a coder. I never got around to learning it — but I've found cautious success letting agents write the code while I worry about the biology. That teaming turns a hypothesis bugging me at one in the morning into an experiment by the next; the loop from question to result got a lot tighter.
This project is the clearest example I have. I heard about EVO2 a few months ago — a model that learned to represent raw nucleotides — and a question stuck in my head: how does that structure hold up on out-of-distribution eukaryote-host viruses, and could you train a classifier on top? That any taxonomic signal survives at all, in a model benchmarked at noise for these sequences, is a small reminder of how far these models generalize.
The Honest Read
LucaProt — Hou et al. 2024, Cell — didn't surface in my literature search. I had a hypothesis bugging me and jumped straight into building, only to find out halfway through that a team with GPU clusters had already solved this two years ago. I thought a solved problem was an open avenue.
I'm still glad I was curious enough to take it as far as I did. Across families the correlation is weak and mostly a length confound; the broad version of the question has a better answer elsewhere. But within the sobemo-like clade where these coral viruses live, a Mantel test found real, length-independent structure in EVO2's embeddings — r = 0.345, p = 0.0001 — on a model built to be bad at exactly these sequences. And when I concatenate EVO2 with ESM2, the taxonomy signal jumps to 0.47 while the length confound drops. The nucleotide model is learning something the protein model doesn't have, and vice versa. That's the part I keep coming back to.
— G.A.
From the Archive
A custom EVO2 runtime — StripedHyena with CPU offloading of layers, cross-checked against the official PyTorch runtime — so a genomic foundation model could run on a 12 GB RTX 3060, with the 2700X / 16 GB DDR4 rig doing the heavy lifting.