Gonzalo Benegas
@gonzalobenegas.bsky.social
260 followers 810 following 17 posts
Comp Bio Postdoc @ UC Berkeley https://gonzalobenegas.github.io/
Posts Media Videos Starter Packs
Reposted by Gonzalo Benegas
yun-s-song.bsky.social
We are excited to share GPN-Star, a cost-effective, biologically grounded genomic language modeling framework that achieves state-of-the-art performance across a wide range of variant effect prediction tasks relevant to human genetics.
www.biorxiv.org/content/10.1...
(1/n)
Reposted by Gonzalo Benegas
joanocha.bsky.social
I am thrilled to announce that in January 2026 I will be starting my own lab at NYU Biology! Soon enough I will be recruiting postdocs and students! Please reach out if you are interested with a CV and description of your research interests, or if you know of people who could be interested! 🧬🗽 🦊
Reposted by Gonzalo Benegas
yun-s-song.bsky.social
How can one efficiently simulate phylodynamics for populations with billions of individuals, as is typical in many applications, e.g., viral evolution and cancer genomics? In this work with M. Celentano, @wsdewitt.github.io , & S. Prillo, we provide a solution. doi.org/10.1073/pnas...
1/n
Reposted by Gonzalo Benegas
yun-s-song.bsky.social
Thrilled to see my digital art on the cover of Trends Genet. The two binary strings represent reverse-complementary DNA sequences (00=A, 01=C, 10=G, 11=T) and the connecting rectangles represent “embeddings” learned by DNA language models. Pls check out our article as well: doi.org/10.1016/j.ti...
Reposted by Gonzalo Benegas
yun-s-song.bsky.social
In our updated TraitGym preprint (w/ @gonzalobenegas.bsky.social & Gökcen Eraslan), we evaluate Evo 2 on regulatory variants associated with human traits. We see marked performance gains with scale on Mendelian traits, although still a bit behind alignment-based methods.
doi.org/10.1101/2025...
1/n
gonzalobenegas.bsky.social
Thank you for contributing to bioicons! Sorry I forgot to add to acknowledgements, I will in the final version!
gonzalobenegas.bsky.social
Scaling is probably part of the solution, but data curation might be the major bottleneck. The vast majority of bases in mammalian genomes lack evolutionary constraint which is precisely the signal leveraged by self-supervision.
gonzalobenegas.bsky.social
Alignment-free DNA language models are not yet competitive. The best among them, our GPN-Promoter and SpeciesLM from @gagneurlab.bsky.social , are not the largest in number of parameters or context. Their key feature is having been trained only on functional regions of the genome.
gonzalobenegas.bsky.social
Conservation-aware CADD and GPN-MSA do better on Mendelian trait variants, expected to be under strong purifying selection. On complex trait variants, especially for non-disease traits, functional-genomics models Enformer and Borzoi tend to do better. However, ensembling helps:
gonzalobenegas.bsky.social
We evaluate models zero-shot (unsupervised) and with linear probing (logistic regression on top of extracted features):
gonzalobenegas.bsky.social
We evaluate a wide range of models with up to 7B parameters and 500K context size. Do these numbers matter? 🤔
gonzalobenegas.bsky.social
We collect putative causal variants from OMIM and UKBB with carefully matched controls.
gonzalobenegas.bsky.social
Can DNA sequence models predict mutations affecting human traits?

We introduce TraitGym, a curated benchmark of causal regulatory variants for 113 Mendelian & 83 complex traits, and evaluate functional genomics and DNA language models. Joint work w/ Gökcen Eraslan and @yun-s-song.bsky.social 🧵👇
Reposted by Gonzalo Benegas
biorxiv-genetic.bsky.social
Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics https://www.biorxiv.org/content/10.1101/2025.02.11.637758v1
gonzalobenegas.bsky.social
I still believe in alignment-free gLMs with better data curation and loss functions, I've been seeing advances but still tough.
gonzalobenegas.bsky.social
*An exception are alignment-based gLMs which do improve (non-trivially) over conservation scores.
gonzalobenegas.bsky.social
A simple bar is: do you surpass conservation scores in identifying functional mutations? This bar was easily passed by pLMs and plant gLMs but not yet by human gLMs* even after 5 years.
Reposted by Gonzalo Benegas
Reposted by Gonzalo Benegas
yun-s-song.bsky.social
Coincidentally, another article from my lab on DNA language models got published on the same day as GPN-MSA. It's freely available for 50 days from this link:

authors.elsevier.com/a/1kNCscQbJB...
Genomic language models: opportunities and challenges

Please share with your colleagues.
authors.elsevier.com
Reposted by Gonzalo Benegas