Peter Koo
@pkoo562.bsky.social
3.1K followers 1.3K following 110 posts
AI4Science researcher. Associate Professor @CSHL. My lab advances AI for genomics and healthcare! http://koo-lab.github.io
Posts Media Videos Starter Packs
Pinned
pkoo562.bsky.social
2025 Machine Learning in Computational Biology (#MLCB) meeting starts TODAY (9/10) at 9:30a (EST) at the NY Genome Center in NYC!

We have a great lineup of keynotes, contributed talks, and posters today and tomorrow

Schedule: mlcb.org/schedule

Join for free via livestream: m.youtube.com/@mlcbconf
MLCB - Schedule
The in-person component will be held at the New York Genome Center, 101 6th Ave, New York, NY 10013. All times below are Eastern Time.
mlcb.org
pkoo562.bsky.social
Congratulations to John Clarke, Michel Devoret and John Martinis on receiving the 2025 Nobel Prize in Physics!
www.nobelprize.org/prizes/physi...

I have fond memories of my time in the Clarke lab, where I did my Honors Thesis on ultra low-field MRI w/ SQUIDs as an undergrad at UC Berkeley!
pkoo562.bsky.social
Check out a Research Highlights on our work at @naturemethods by Lin Tang!

www.nature.com/articles/s41...
pkoo562.bsky.social
Richard Bonneau giving the last keynote on navigating the complexity of drug discovery and their lab-in-the-loop for molecule design! #MLCB
pkoo562.bsky.social
First talk a (surprise) keynote by Jacob Schreiber from UMass Medical talking about fruit-themed AI tools for understanding and designing regulatory DNA
pkoo562.bsky.social
Now Barbara Engelhardt giving a keynote on characterizing behaviors of modified T cells in live cell imaging data using machine learning!
pkoo562.bsky.social
Next talk by Courtney Shearer who is talking about genomic language models for zero shot promoter indel effects!
pkoo562.bsky.social
Next talk by Alan Murphy and Masayuki (Moon) Nagai (from my lab!) who are talking about how naive fine-tuning genomic DNNs leads to catastrophic forgetting and propose *iterative causal refinement* to improve learned associations to causal understanding of cis-regulatory biology!
pkoo562.bsky.social
Next talk by Johannes Linder at Calico. Talking about expanding genomic seq2fun DNNs with RBP binding and RNA processing data to consider post-transcriptional regulation.
pkoo562.bsky.social
Some technical delays but we are all set!

First talk by Alexis Battle! @alexisbattle.bsky.social
pkoo562.bsky.social
2025 Machine Learning in Computational Biology (#MLCB) meeting starts TODAY (9/10) at 9:30a (EST) at the NY Genome Center in NYC!

We have a great lineup of keynotes, contributed talks, and posters today and tomorrow

Schedule: mlcb.org/schedule

Join for free via livestream: m.youtube.com/@mlcbconf
MLCB - Schedule
The in-person component will be held at the New York Genome Center, 101 6th Ave, New York, NY 10013. All times below are Eastern Time.
mlcb.org
pkoo562.bsky.social
Here's another unpublished result:

We compared probing strategies to assess how informative the pretrained representations are—benchmarking Evo2 vs D3 on Drosophila enhancer activity measured via STARR-seq.

Again, D3 outperforms Evo2 (and one-hot) across all probing methods!
pkoo562.bsky.social
But, when we trained D3 (score-entropy discrete diffusion for regulatory DNA) in an unsupervised manner on the genomic sequences, probing the representations of D3 was comparable to supervised SOTA (even with a basic CNN)! (100M parameters vs 40B parameters)
pkoo562.bsky.social
*Easter egg alert* NOT in the published paper. We also benchmarked Evo 2 and while it did better than other gLMs (consistent that scale can improve gLMs), it still falls short of a basic CNN trained using one-hot sequences and far short of supervised SOTA
pkoo562.bsky.social
Also, my perspective is coming from gLMs applied to human genomes. I think they have a lot of potential for small compact genomes that don't have as layered regulation as higher-order eukaryotes.
pkoo562.bsky.social
gLMs provide promise in learning structure in the genome, but we need to rethink how we either tokenize the genome (and no byte pair encoding isn't the answer either) or come up with a better masking strategy for non-coding genome that is different from other regions (eg coding).
pkoo562.bsky.social
Tokenizing nucleotides/kmers and treating each token equally is like injecting lots of random words between every word in a sentence and hope that a LLM will learn the structure of the english language.
pkoo562.bsky.social
It's unclear whether standard NLP-based objectives (MLM or CLM) will bring us to the promised land.

Unlike proteins, which have conservation at sequence and covariation levels, non-coding genome is conserved at functional level -- lots of drift and uninformative positions!
pkoo562.bsky.social
There are many great applications for gLMs -- I'm not just a hater. The central dogma (or whatever that is being sold) is not one of them.

In terms of non-coding genome regulation (outside of splice sites) in humans, there is a huge uphill battle.
pkoo562.bsky.social
Breaking the constant propagation of pointless gLM benchmarks in the ML field (that are disconnected from how biologists will use them) is what is giving gLMs unwarranted hype. The field must rally around useful applications of gLMs.
pkoo562.bsky.social
Our benchmark is far from complete! It shows how current gLMs struggle in zero-shot capabilities for cell-type specific regulation. Think about all the differential regulation across cell types being projected onto a single genome -- this is hard to learn w/o functional data!
pkoo562.bsky.social
Our benchmark is far from complete! It shows how current gLMs struggle in zero-shot capabilities for cell-type specific regulation. Think about all the differential regulation across cell types being projected onto a single genome -- this is hard to learn w/o functional data!
pkoo562.bsky.social
This went 3 rounds of review in another journal, but 1 reviewer was adamant that this type of benchmark might be harmful to the burgeoning gLM field, which currently only benchmarks relative performance on (nearly) useless benchmarks in the non-coding regions. It was rejected!