Yun S. Song
@yun-s-song.bsky.social
740 followers 120 following 59 posts
Professor of EECS and Statistics at UC Berkeley. Mathematical and computational biologist.
Posts Media Videos Starter Packs
yun-s-song.bsky.social
Not yet, but we will surely generate bp-resolution genome-wide scores for all six species studied in the paper and make them publicly available. For now, we have predictions for ~10M variants used in the S-LDSC analysis in humans.
Reposted by Yun S. Song
anshulkundaje.bsky.social
This is truly an incredible breakthrough IMO. Really exemplifies what you get when deep domain expertise (popgen/evolution/disease genetics in this case) fuses with cleverly crafted ML. What u get r sleek, well thought out architectures that absolutely destroy the behemoths. Wow!! 1/
yun-s-song.bsky.social
We are excited to share GPN-Star, a cost-effective, biologically grounded genomic language modeling framework that achieves state-of-the-art performance across a wide range of variant effect prediction tasks relevant to human genetics.
www.biorxiv.org/content/10.1...
(1/n)
yun-s-song.bsky.social
All in all, we believe that GPN-Star offers a scalable & flexible approach for training effective gLMs.

This work was led by my talented students @czye.bsky.social and @gonzalobenegas.bsky.social, with contributions from other lab members, @peterdfields.bsky.social at Jax, & B. Clarke at DKFZ
(n/n)
yun-s-song.bsky.social
Upon publication, we will release base-resolution predictions for the human genome and the five model organisms.
Codes to train the model, run inference, and reproduce the analyses are available on GitHub (github.com/songlab-cal/...) and Hugging Face (tinyurl.com/nhhcppvm).
(9/n)
GitHub - songlab-cal/gpn: Genomic Pre-trained Network
Genomic Pre-trained Network. Contribute to songlab-cal/gpn development by creating an account on GitHub.
github.com
yun-s-song.bsky.social
To show that GPN-Star is a robust and generalizable framework that can advance biology beyond human genetics, we apply it to train gLMs for five well-studied model organisms and demonstrate their effectiveness in assessing variant effects in these species.
(8/n)
yun-s-song.bsky.social
In addition, GPN-Star exhibits meaningful nucleotide dependencies that align with known functional dependencies, indicating its potential to help understand genomic syntax. This represents a notable advance over traditional conservation scores.
(7/n)
yun-s-song.bsky.social
By training GPN-Star on vertebrate, mammal, and primate alignments, we reveal task-dependent advantages of modeling deeper versus more recent evolution. These findings offer new biological insights and practical guidance for developing future gLMs and evolutionary models.
(6/n)
yun-s-song.bsky.social
GPN-Star achieves unprecedented SNP heritability enrichments across over 100 human complex traits. Moreover, we devise a simple approach to incorporate tissue-specificity into the model prediction and show that it further improves heritability enrichment.
(5/n)
yun-s-song.bsky.social
We compare GPN-Star with several models, including the recent AlphaGenome and Evo2 models with up to 1Mb context size and 40B parameters, and observe that GPN-Star consistently ranks at the top across a wide range of human variant effect prediction tasks.
(4/n)
yun-s-song.bsky.social
We also introduce a calibration method that removes the confounding effect of mutation rate variation from gLM predictions for the first time. This improves downstream performance and enables a more direct interpretation of model scores as estimates of selective constraint.
(3/n)
yun-s-song.bsky.social
GPN-Star features a novel phylogeny-aware architecture that enables the model to explicitly capture evolutionary relationships encoded in whole-genome alignments and overcomes the key limitations of our earlier model GPN-MSA (doi.org/10.1038/s415...).
(2/n)
yun-s-song.bsky.social
We are excited to share GPN-Star, a cost-effective, biologically grounded genomic language modeling framework that achieves state-of-the-art performance across a wide range of variant effect prediction tasks relevant to human genetics.
www.biorxiv.org/content/10.1...
(1/n)
yun-s-song.bsky.social
Thanks, Josh. I wish you had been one of our reviewers—life would’ve been so much easier.
yun-s-song.bsky.social
SINGER, our ARG inference method, is finally published and freely available online:

doi.org/10.1038/s415...

It was a long journey – 16 months from initial submission to acceptance. Is it just me, or has peer review gotten more arduous lately? 4+ rounds of review isn't so unusual these days...
Robust and accurate Bayesian inference of genome-wide genealogies for hundreds of genomes - Nature Genetics
SINGER is a method for creating ancestral recombination graphs to understand the genealogical history of genomes. The method has increased speed, and thus scalability, without sacrificing accuracy.
doi.org
Reposted by Yun S. Song
alan-aw.bsky.social
Hi Bluesky — Dedicating my first post to this work and software, led by the incredibly meticulous and capable @fandingzhou.bsky.social! An earlier version of this was shared at the 2022 Bioconductor Conference (bioc2022.bioconductor.org/schedule/).
fandingzhou.bsky.social
Gene expression changes aren’t just about mean shifts — variability shifts matter too, especially for aging. We're thrilled to introduce QRscore, a flexible non-parametric framework for detecting shifts in mean and variance across conditions. doi.org/10.1016/j.cr...
Reposted by Yun S. Song
fandingzhou.bsky.social
Gene expression changes aren’t just about mean shifts — variability shifts matter too, especially for aging. We're thrilled to introduce QRscore, a flexible non-parametric framework for detecting shifts in mean and variance across conditions. doi.org/10.1016/j.cr...
yun-s-song.bsky.social
This work was led by my talented student Milind Jagota @milindjagota.bsky.social in collaboration with colleagues at UC Berkeley, UCSF (the Ye Lab @yimmieg.bsky.social), and Fred Hutch (the Matsen Lab @matsen.bsky.social). We are grateful to all co-authors for their enthusiasm and hard work. (n/n)
ky.social
yun-s-song.bsky.social
From a machine learning perspective, this work illustrates the value of high-quality negative examples. The paper is mostly focused on BCR light chains, but we are excited about extensions. (10/n)
yun-s-song.bsky.social
We interpret what sequence features the model associates with dysfunction. One example is shown below. For a specific light chain V- and J- gene, we observe sharp selection on CDRL3 length, and on certain amino acids. (9/n)
yun-s-song.bsky.social
In new data, we find that very low scores are associated with reduced surface expression in naive B cells. To our knowledge, this is the first time expression variation in naive B cells has been linked to the light chain. (8/n)
yun-s-song.bsky.social
B cells can further mutate antibodies to improve binding. We compare observed mutations to random control sets of mutations. Mutations that significantly decrease model scores appear to be selected out. However, this only works in a few positions. (7/n)
yun-s-song.bsky.social
Models trained on allelic inclusion generalize to predict antibody properties with no direct training. Here we apply models to independent data measuring polyreactivity of human antibodies and observe correlation with polyreactivity. Baselines don’t capture this signal. (6/n)
yun-s-song.bsky.social
We don’t know which sequence in each double-light B cell is “bad”, but we develop a training framework that doesn’t need this information. We compare with baseline approaches that don’t use the new allelic inclusion data. (5/n)
yun-s-song.bsky.social
We propose using double-light B cells as negative examples for antibody machine learning. Double-light B cells can be observed at scale in some recent datasets of human antibodies. Each such cell has one “bad” sequence, whereas other cells all have functional antibodies. (4/n)