The pace of genomic sequencing has outstripped our ability to understand what all those letters actually mean. We’ve got hundreds of thousands of human genomes and countless more from other species, but biology still feels like reading a manuscript with most of the punctuation missing. The genome is not a neatly spaced text; it’s a palimpsest, a layered story written over millions of years of evolution where signals hide in long, context-rich passages. In this landscape, a new kind of AI is stepping in to translate the language of life with a more honest gaze at variation.
Researchers at IBM Research, with teams in Yorktown Heights and Haifa, have built a pair of DNA foundation models that explicitly learn from the types of sequence variation that actually drive biology. Their SNP aware models, BMFM-DNA-REF and BMFM-DNA-SNP, tackle the problem that most previous models trained on a fixed reference genome miss: what happens when a single nucleotide changes, or when insertions and deletions pop up across the population? The work, led by Hongyang Li and Sanjoy Dey with colleagues, shows that teaching a model about real world DNA variation can improve its grasp of regulatory function across tasks from promoter detection to splicing. It also marks a bold move toward codifying how genetic diversity shapes biology using modern AI tooling.
In a field where it’s easy to overhype a new model, this effort sticks to a crisp thesis: if a model learns not just the reference sequence but the actual variant landscape that humans carry, it becomes better at predicting how those variants influence biology. The team frames the project as not just another genome-language model but a SNP aware DNA foundation model. They pre-train two versions of the model, one on the standard genome and one on a variant-encoded genome that explicitly represents SNPs and other variations. The result is a computational probe that can, in silico, peek at how natural genetic variation may shift regulatory logic across long stretches of DNA. It’s a translation tool for a genome that is, in many places, a moving target rather than a fixed script.
These efforts come from IBM Research, a reminder that big science isn’t only about the lab with glass walls. The study credits the Yorktown Heights and Haifa groups and foregrounds two deployable models that mirror a familiar AI playbook: build a strong reference system, then add a variant-aware companion to see how much the extra knowledge helps. And it’s not just a technical curiosity. The authors explicitly connect their work to tasks that matter for human biology—predicting where promoters sit, where transcription factors bind, how splicing might unfold, and even linking SNPs to diseases in GWAS catalogs. They also publish their software and models openly, inviting the broader community to test, critique, and extend what a SNP aware model can do.
What makes a SNP aware foundation model
At the core is a transformer engine trained to read DNA as a language, but with a twist that respects biology’s peculiar grammar. The team uses ModernBERT, a modern bidirectional encoder designed for long DNA sequences and efficient training. They train two versions of BMFM-DNA. The first, BMFM-DNA-REF, is trained on the reference genome and its reverse complements—think of it as a traditional textbook approach where the model learns from a single, canonical script. The second, BMFM-DNA-SNP, adds a twist: it samples from a database of real human variation to create variant-encoded inputs. In practice, this means the model never only sees a single path through the genome; it also encounters the branches where life has diverged across people and time.
To prepare data, the researchers pulled 20 million genetic variants from dbSNP and built a genome-wide variation frequency matrix. For each position with variants, the matrix encodes the probabilities of alternative nucleotides, insertions, or deletions based on population frequencies. When generating variant-encoded sequences, they sample from that matrix and map the results to a compact representation. Rather than using two characters for every risky site, they assign a single special symbol drawn from a set of 121 characters that includes a nod to Li Sao, a classical Chinese poem. In effect, a single token in the variant-encoded model may stand in for multiple possible nucleotides, insertions, or deletions at a given position. This clever encoding keeps the input length manageable while preserving the sense of biological ambiguity that comes with real-world variation.
On the representation side, they rely on a Byte Pair Encoding style tokenizer to build a 4096-token vocabulary, with separate tokenizers for the reference genome and the variant-encoded genome. This dual-tokenizer setup is more than a bookkeeping detail. It produces two compatible but distinct views of the genome’s language, enabling the model to reason about when a variation matters and how it shifts the surrounding syntax. The authors’ analyses reveal that while most tokens are shared between the two vocabularies, roughly 6.8% of tokens in the variant-encoded vocabulary carry variants, and these tokens tend to be shorter. That pattern reflects the genome’s underlying redundancy: a small number of flexible tokens can encode a surprising amount of contextual variation when used in long sequences. It also hints at how much of biology can ride on a handful of high-leverage variants, if you know where to look.
The architecture itself is a modern DNA workhorse. BMFM-DNA-REF uses a 22-layer, 768-hidden-size transformer with a 2048 long token context, combining global and local attention patterns to handle long sequences efficiently. The SNP version shares the same backbone but with a training regime designed to fuse sequence variation into the learning process. In both cases, training uses a masked language modeling objective in which the model learns to predict masked tokens from their context. The pre-training is substantial but not absurd: 150,000 steps on a million-plus sequences, running on high-end GPUs for about 10 days. The result is a pair of foundation models that stand on the same scaffold but differ in how they treat genetic variation.
One of the more striking technical moves is how the SNP version handles variant-encoding. Rather than adding a second character to represent each allele, they map the two alleles and any associated insertions or deletions to a single variant token. This compact approach keeps the model lean and allows it to absorb a wide swath of variant possibilities without exploding the input length. The end effect is a system that can, in principle, reason about how a site with a SNP in the population might shift the local regulatory logic, rather than treating that site as a fixed anchor in space and time.
Why encoding SNPs matters
The proof is in the experiments. The authors fine-tune both BMFM-DNA-REF and BMFM-DNA-SNP on six tasks that are well aligned with what genomic models should do: promoter detection, core promoter detection, transcription factor binding site prediction, splicing site prediction, a large-scale promoter activity readout (massively parallel reporter assays), and a disease association task drawn from GWAS catalogs. They pit their SNP-aware model against a strong baseline, DNABERT-2, and a no-pretraining option. Across most tasks, the SNP-aware model holds its own against a longer, multi-species baseline and, most importantly, edges out the reference model on several core regulatory tasks.
In a clean summary, BMFM-DNA-SNP and BMFM-DNA-REF show comparable performance to DNABERT-2 on many benchmarks, despite being trained on a human genome focused dataset rather than a broad zoo of species. The real upshot is that the SNP-aware variant tends to do better on tasks that hinge on regulatory logic and variant effects. In promoter and core promoter detection, TF binding site prediction, and splicing, the SNP-aware model often lands higher F1 scores than its reference-only sibling. The SNP-to-disease task, while still competitive, shows little improvement over the reference model, suggesting that predicting disease associations may rely on different signals or require more data and context beyond local sequence features alone.
The team also ran a deeper look at variant encoding strategies by tweaking how they generate negative samples for the promoter task. They found that forming negatives by inserting or reshuffling SNPs in a way that preserves the SNP distribution (their Class 3 approach) can yield meaningful gains, sometimes surpassing the plain reference-based baseline in the variant-encoded setting. This isn’t just a toy result; it shows that the way we simulate biology in training tasks can shape what the model learns about variant effects. It also provides a cautionary note: how we construct training tasks and negative samples can subtly tilt what an AI learns about the genome’s language.
Beyond the numbers, the authors’ decision to release the models and code is its own kind of proof of concept. They want the research community to push on open datasets, test across longer contexts, and extend the work with epigenomic and multiomic information. The release makes clear they’re not finished with this language; they’re inviting others to help write the next chapters. It’s a sign of a field moving toward shared benchmarks and collective scrutiny, which in science can be as important as any single model improvement.
What this means for biology and medicine
If you squint at the broader arc, the SNP aware approach represents a shift in how we think about predictive biology. The genome is not a static instruction set; it is a living document whose meaning shifts with context, cell type, developmental stage, and the surrounding regulatory signals. By explicitly encoding sequence variation, these models start to model that fluidity. They attempt to connect a dot in the genetic alphabet with a potential shift in regulatory output, which is where much of human disease risk is thought to live. In practical terms, SNP aware foundation models could help annotate noncoding variants, prioritize which SNPs to study in the lab, and suggest hypotheses about how a variant might alter a promoter or an enhancer to tweak gene expression.
The implications for personalized medicine are tantalizing, albeit still distant. If an AI can robustly interpret how a given SNP sits inside a promoter or a splicing motif, clinicians and researchers could gain a tool for triaging variants found in patients. This could, in time, complement experimental assays and GWAS results to sharpen our understanding of an individual’s regulatory risk landscape. But the road from model prediction to clinical decision is long and careful. The authors themselves note that while the SNP aware models push the envelope on several regulatory tasks, comprehensive validation—especially for disease prediction—will require broader datasets, longer sequence contexts, and integrations with epigenomic data and three-dimensional genome architecture.
There are also methodological takeaways. The drop-in use of a variant frequency matrix, the mapping of variants to compact tokens, and the creation of a variant aware tokenizer all suggest a blueprint for how to inject real world variation into other foundation models. It isn’t just about adding more data; it’s about changing the way the model conceptualizes a site in a genome by including what can happen there, not just what does happen in a reference frame. The result is a more flexible, arguably more honest representation of biological sequence space, one that recognizes that life’s grammar often hinges on a handful of flexible tokens rather than a fixed sentence structure.
From a broader science policy and community angle, the work signals a healthy shift toward openness and collaboration. The authors stress that the benchmarks are not yet exhaustive and call for community contributions to expand, test, and refine datasets, especially those that probe longer contexts and epigenomic layers. It’s a reminder that biology is not a solved puzzle but a moving target, and the most powerful tools will come from shared effort as much as from clever algorithms.
In the end, BMFM-DNA and its SNP aware cousin offer a new lens on a familiar problem: how to read the genome in a way that respects human diversity. The work shows that including natural variation in training data can strengthen a model’s intuition about regulatory logic. It doesn’t declare victory over biology, but it does promise a more nuanced, context-aware AI that can help scientists map the causal paths from sequence to function. If you think of modern AI as a toolbox, this project adds a new implement—the variant-aware chisel—that can carve sharper, more context-sensitive inferences from the genome’s long, complicated text.
The study is a collaboration anchored in IBM Research, with authors including Hongyang Li and Sanjoy Dey as equal contributors and Bharath Dandala and Pablo Meyer providing leadership and correspondence. They are explicit about the work being a step toward a common trajectory: to extend foundation models so they can ingest not just reference sequences but the rich tapestry of human genetic variation that makes each genome unique. And they have opened the doors: the BMFM-DNA models and the reproduction code are available for the research community to test, critique, and improve. If you want to see what happens when a model learns to read life with its natural variability spoken aloud, this is a story worth following—and perhaps a future toolkit for discovering where biology still hides its secrets.
Lead institutions and authors: This work comes from IBM Research with teams in Yorktown Heights, New York, and Haifa, Israel. The lead authors are Hongyang Li and Sanjoy Dey, who contributed equally. The study also acknowledges key roles from Bum Chul Kwon, Michael Danziger, Michal Rosen-Tzvi, Jianying Hu, James Kozloski, Ching-Huei Tsou, Bharath Dandala, and Pablo Meyer, with Dandala and Meyer listed as correspondence contacts. The project situates itself in the growing frontier of genomic foundation models that seek to fuse the raw grammar of DNA with the real world of genetic variation, a fusion that could accelerate biology by teaching machines to read not just the canonical script but the living edition that millions of people carry in their genomes.