Lightnews — Scholar-powered news

Reposted by Pooja Kathail

Liana Lareau @lianafaye.bsky.social · Aug 7

This preprint from Helen Sakharova is one of the coolest things to come out of my lab: “Protein language models reveal evolutionary constraints on synonymous codon choice.” Codon choice is a big puzzle in how information is encoded in genomes, and we have a new angle. www.biorxiv.org/content/10.1...

Protein language models reveal evolutionary constraints on synonymous codon choice

Evolution has shaped the genetic code, with subtle pressures leading to preferences for some synonymous codons over others. Codons are translated at different speeds by the ribosome, imposing constrai...

www.biorxiv.org

6 83 210

Reposted by Pooja Kathail

Anshul Kundaje @anshulkundaje.bsky.social · Jun 9

Congratulations to incoming postdoc @rrastogi.bsky.social for being awarded the Warren Alpert Postdoctoral Scholarship! Look forward to having him join us in soon!

1 15

Reposted by Pooja Kathail

David A Knowles @davidaknowles.bsky.social · May 31

We had a bunch of requests so we're extending the #MLCB2025 deadline to June 3rd (anywhere on earth)! cmt3.research.microsoft.com/MLCB2025 to submit.

2 9

Reposted by Pooja Kathail

Sara Mostafavi @saramostafavi.bsky.social · Apr 16

Some encouraging news for cross-gene generalization of allele effects in S2F models. www.biorxiv.org/content/10.1...

Deep genomic models of allele-specific measurements

Allele-specific quantification of sequencing data, such as gene expression, allows for a causal investigation of how DNA sequence variations influence cis gene regulation. Current methods for analyzin...

www.biorxiv.org

1 7 15

Reposted by Pooja Kathail

Sara Mostafavi @saramostafavi.bsky.social · Mar 15

Our new pre-print, investigating a few important questions when we train S2F models on different types of MPRA datasets. Congrats to Yilun and @xinmingtu.bsky.social www.biorxiv.org/content/10.1...

Investigating Data Size, Sequence Diversity, and Model Complexity in MPRA-based Sequence-to-Function Prediction

We created the MPRA Dataset Collection (MDC), a curated resource of MPRA data from 12 studies comprising over 150 million labeled DNA subsequences. These datasets include both random and natural genom...

www.biorxiv.org

11 25

Reposted by Pooja Kathail

Jeremy Berg @jeremymberg.bsky.social · Mar 11

I have confirmation from several sources now that all T32s, many F30s and F31s, and most or all Center awards (P30, P50) have been terminated at Columbia.

This is quite damaging to research and to individuals.

This is pure terrorism and cannot be legal. But litigation will take time...

18 310 570

Reposted by Pooja Kathail

David A Knowles @davidaknowles.bsky.social · Mar 11

Wow. "NIH" canceled my co-mentored (with Dave Sulzer) PhD student's F31 funding. His work is on understanding the genetics and neuroscience of language learning disorders. F31 provides no indirect $ to Columbia, just pays his salary. Not that it should matter, but he's an American citizen. W.T.F.

21 220 520

Reposted by Pooja Kathail

Fernando Pérez @fernandoperez.org · Mar 7

It's today, T-3h! If you're in the East Bay and care about science or education (i.e. if you care about living on this planet in any form 😃), join us, 11:45 at Upper Sproul!

And if you're elsewhere, look up a local event in your area, there's a LOT happening today!

www.standup4scienceberkeley.com

Map of Northern hemisphere with many blue place markers.

2 7

Reposted by Pooja Kathail

Ya'el Courtney, PhD @scienceyael.bsky.social · Feb 27

NEXT FRIDAY! San Francisco. I'll be there.

@standupforscience.bsky.social #StandUpforScience #SciComm #Science

San Francisco
Stand Up for Science 2025
March 7, 2025
Civic Center Plaza
1-3pm
science is for everyone
find your local rally site and other ways to get involved
standupforscience2025.org

1 44 98

Reposted by Pooja Kathail

Sara Mostafavi @saramostafavi.bsky.social · Feb 23

Our new paper describing a scalable approach for training sequence-to-function models on personal genomes ("personal genome training"), includes our observations on when this works and its limitations. www.biorxiv.org/content/10.1...
Congrats: Anna, @xinmingtu.bsky.social , @lxsasse.bsky.social

A scalable approach to investigating sequence-to-expression prediction from personal genomes

A key promise of sequence-to-function (S2F) models is their ability to evaluate arbitrary sequence inputs, providing a robust framework for understanding genotype-phenotype relationships. However, despite strong performance across genomic loci , S2F models struggle with inter-individual variation. Training a model to make genotype-dependent predictions at a single locus-an approach we call personal genome training-offers a potential solution. We introduce SAGE-net, a scalable framework and software package for training and evaluating S2F models using personal genomes. Leveraging its scalability, we conduct extensive experiments on model and training hyperparameters, demonstrating that training on personal genomes improves predictions for held-out individuals. However, the model achieves this by identifying predictive variants rather than learning a cis-regulatory grammar that generalizes across loci. This failure to generalize persists across a range of hyperparameter settings. These findings highlight the need for further exploration to unlock the full potential of S2F models in decoding the regulatory grammar of personal genomes. Scalable software and infrastructure development will be critical to this progress. ### Competing Interest Statement The authors have declared no competing interest.

www.biorxiv.org

15 31

Reposted by Pooja Kathail

Andrew Marderstein @amarderstein.bsky.social · Feb 19

New preprint w/ @soumyakundu.bsky.social @sbmontgom.bsky.social @anshulkundaje.bsky.social !

Using deep learning & scATAC-seq, we studied context-specific variants in disease & evolution, and introduce FLARE for de novo mutations—w/ application to autism-affected families.

doi.org/10.1101/2025...

Mapping the regulatory effects of common and rare non-coding variants across cellular and developmental contexts in the brain and heart

Whole genome sequencing has identified over a billion non-coding variants in humans, while GWAS has revealed the non-coding genome as a significant contributor to disease. However, prioritizing causal...

www.biorxiv.org

17 31

Reposted by Pooja Kathail

Saori Sakaue @saorisakaue.bsky.social · Feb 27

📣Excited to share my last postdoc paper with
@soumya-boston.bsky.social on eQTL mechanisms depending on where the RNA is in the cell! @broadinstitute.org @harvardmed.bsky.social
TL;DR:Early RNA eQTL variants in the nucleus and late RNA eQTL variants in the cytosol have distinct molecular mechanism🧵

2 22 71

Reposted by Pooja Kathail

Peter Koo @pkoo562.bsky.social · Feb 5

[SAVE THE DATE] MLCB 2025 is happening Sept 10-11 at the NY Genome Center in NYC!

Attend the premier conference at the intersection of ML & Bio, share your research and make lasting connections!

Submission deadline: June 1
More details: mlcb.github.io

Help spread the word—please RT! #MLCB2025

1 27 41

Reposted by Pooja Kathail

David A Knowles @davidaknowles.bsky.social · Jan 27

#MLCB2025 will be Sept 10-11 at @nygenome.org in NYC! Paper deadline June 1st & in-person registration will open in May. Please sign up for our mailing list groups.google.com/g/mlcb/ for future announcements. More details at mlcb.github.io. Please RP!

14 33

Reposted by Pooja Kathail

Austin Wang @austintwang.bsky.social · Dec 11

(1/10) Excited to announce our latest work! @arpita-s.bsky.social, @amanpatel100.bsky.social , and I will be presenting DART-Eval, a rigorous suite of evals for DNA Language Models on transcriptional regulatory DNA at #NeurIPS2024. Check it out! arxiv.org/abs/2412.05430

DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA

Recent advances in self-supervised models for natural language, vision, and protein sequences have inspired the development of large genomic DNA language models (DNALMs). These models aim to learn gen...

arxiv.org

1 27 70

Reposted by Pooja Kathail

Amy Lu @amyxlu.bsky.social · Dec 6

1/🧬 Excited to share PLAID, our new approach for co-generating sequence and all-atom protein structures by sampling from the latent space of ESMFold. This requires only sequences during training, which unlocks more data and annotations:

bit.ly/plaid-proteins
🧵

1 37 120

Pooja Kathail @poojakathail.bsky.social · Nov 20

Finally, we discuss downstream applications of models to understand disease-relevant non-coding variants, such as functionally informed fine-mapping and de novo variant prioritization. 4/4

1

Pooja Kathail @poojakathail.bsky.social · Nov 20

We also review variant effect prediction evaluations that have been performed to date on genomic deep learning models, highlighting strengths and limitations of current models and the need for more comprehensive evaluation. 3/4

Overview of variant effect prediction evaluations that have been
performed to date using current genomic deep learning models.

1 1 2

Pooja Kathail @poojakathail.bsky.social · Nov 20

We cover two popular genomic deep learning modeling paradigms — supervised sequence-to-activity models and self-supervised genomic language models — and describe practical considerations for using both types of models to make variant effect predictions. 2/4

Schematic overview of two popular genomic deep learning modeling paradigms.

Constructing variant effect predictions using genomic deep learning
models

1 2

Pooja Kathail @poojakathail.bsky.social · Nov 20

Super excited to share our review on genomic deep learning models for non-coding variant effect prediction, with Ayesha Bajwa and Nilah Ioannidis. We’d like this review to be a useful resource, and welcome any feedback, comments, or questions! 1/4

arxiv.org/abs/2411.11158

Leveraging genomic deep learning models for non-coding variant effect prediction

The majority of genetic variants identified in genome-wide association studies of complex traits are non-coding, and characterizing their function remains an important challenge in human genetics. Gen...

arxiv.org

1 13 34