Lightnews — Scholar-powered news

Peter Koo @pkoo562.bsky.social · 1d

Congratulations to John Clarke, Michel Devoret and John Martinis on receiving the 2025 Nobel Prize in Physics!
www.nobelprize.org/prizes/physi...

I have fond memories of my time in the Clarke lab, where I did my Honors Thesis on ultra low-field MRI w/ SQUIDs as an undergrad at UC Berkeley!

7

Peter Koo @pkoo562.bsky.social · 19d

Check out a Research Highlights on our work at @naturemethods by Lin Tang!

www.nature.com/articles/s41...

7

Peter Koo @pkoo562.bsky.social · 27d

Richard Bonneau giving the last keynote on navigating the complexity of drug discovery and their lab-in-the-loop for molecule design! #MLCB

2

Peter Koo @pkoo562.bsky.social · 27d

First talk a (surprise) keynote by Jacob Schreiber from UMass Medical talking about fruit-themed AI tools for understanding and designing regulatory DNA

3

Peter Koo @pkoo562.bsky.social · 27d

2025 MLCB day 2 is starting now!

Streaming live now!
m.youtube.com/watch?v=PxlXNb…

https://m.youtube.com/watch?v=PxlXNb…

1 3

Peter Koo @pkoo562.bsky.social · 28d

Now Barbara Engelhardt giving a keynote on characterizing behaviors of modified T cells in live cell imaging data using machine learning!

4

Peter Koo @pkoo562.bsky.social · 28d

Next talk by Courtney Shearer who is talking about genomic language models for zero shot promoter indel effects!

1 1

Peter Koo @pkoo562.bsky.social · 28d

Next talk by Alan Murphy and Masayuki (Moon) Nagai (from my lab!) who are talking about how naive fine-tuning genomic DNNs leads to catastrophic forgetting and propose *iterative causal refinement* to improve learned associations to causal understanding of cis-regulatory biology!

1

Peter Koo @pkoo562.bsky.social · 28d

Next talk by Johannes Linder at Calico. Talking about expanding genomic seq2fun DNNs with RBP binding and RNA processing data to consider post-transcriptional regulation.

1 1

Peter Koo @pkoo562.bsky.social · 28d

Some technical delays but we are all set!

First talk by Alexis Battle! @alexisbattle.bsky.social

1 5

Peter Koo @pkoo562.bsky.social · 28d

Here is the YouTube live link:

www.youtube.com/live/19I7xTh...

Starts at 9:30a!

Machine Learning in Computational Biology 2025

YouTube video by Machine Learning in Computational Biology

www.youtube.com

4

Peter Koo @pkoo562.bsky.social · 28d

2025 Machine Learning in Computational Biology (#MLCB) meeting starts TODAY (9/10) at 9:30a (EST) at the NY Genome Center in NYC!

We have a great lineup of keynotes, contributed talks, and posters today and tomorrow

Schedule: mlcb.org/schedule

Join for free via livestream: m.youtube.com/@mlcbconf

MLCB - Schedule

The in-person component will be held at the New York Genome Center, 101 6th Ave, New York, NY 10013. All times below are Eastern Time.

mlcb.org

1 7 13

Peter Koo @pkoo562.bsky.social · Jul 16

Here's another unpublished result:

We compared probing strategies to assess how informative the pretrained representations are—benchmarking Evo2 vs D3 on Drosophila enhancer activity measured via STARR-seq.

Again, D3 outperforms Evo2 (and one-hot) across all probing methods!

2

Peter Koo @pkoo562.bsky.social · Jul 16

But, when we trained D3 (score-entropy discrete diffusion for regulatory DNA) in an unsupervised manner on the genomic sequences, probing the representations of D3 was comparable to supervised SOTA (even with a basic CNN)! (100M parameters vs 40B parameters)

1 2

Peter Koo @pkoo562.bsky.social · Jul 16

*Easter egg alert* NOT in the published paper. We also benchmarked Evo 2 and while it did better than other gLMs (consistent that scale can improve gLMs), it still falls short of a basic CNN trained using one-hot sequences and far short of supervised SOTA

Peter Koo @pkoo562.bsky.social · Jul 16

Our work on "Evaluating the representational power of pre-trained DNA language models for regulatory genomics" led by @AmberZqt with help from @NiraliSomia & @stevenyuyy is finally published in Genome Biology! Check it out!

genomebiology.biomedcentral.com/articles/10....

Evaluating the representational power of pre-trained DNA language models for regulatory genomics - Genome Biology

Background The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of ...

genomebiology.biomedcentral.com

1 5 26

Peter Koo @pkoo562.bsky.social · Jul 16

Also, my perspective is coming from gLMs applied to human genomes. I think they have a lot of potential for small compact genomes that don't have as layered regulation as higher-order eukaryotes.

2

Peter Koo @pkoo562.bsky.social · Jul 16

gLMs provide promise in learning structure in the genome, but we need to rethink how we either tokenize the genome (and no byte pair encoding isn't the answer either) or come up with a better masking strategy for non-coding genome that is different from other regions (eg coding).

1 3

Peter Koo @pkoo562.bsky.social · Jul 16

Tokenizing nucleotides/kmers and treating each token equally is like injecting lots of random words between every word in a sentence and hope that a LLM will learn the structure of the english language.

1 1

Peter Koo @pkoo562.bsky.social · Jul 16

It's unclear whether standard NLP-based objectives (MLM or CLM) will bring us to the promised land.

Unlike proteins, which have conservation at sequence and covariation levels, non-coding genome is conserved at functional level -- lots of drift and uninformative positions!

1 2

Peter Koo @pkoo562.bsky.social · Jul 16

There are many great applications for gLMs -- I'm not just a hater. The central dogma (or whatever that is being sold) is not one of them.

In terms of non-coding genome regulation (outside of splice sites) in humans, there is a huge uphill battle.

1 2

Peter Koo @pkoo562.bsky.social · Jul 16

Breaking the constant propagation of pointless gLM benchmarks in the ML field (that are disconnected from how biologists will use them) is what is giving gLMs unwarranted hype. The field must rally around useful applications of gLMs.

1 2

Peter Koo @pkoo562.bsky.social · Jul 16

Our benchmark is far from complete! It shows how current gLMs struggle in zero-shot capabilities for cell-type specific regulation. Think about all the differential regulation across cell types being projected onto a single genome -- this is hard to learn w/o functional data!

1

Peter Koo @pkoo562.bsky.social · Jul 16

Our benchmark is far from complete! It shows how current gLMs struggle in zero-shot capabilities for cell-type specific regulation. Think about all the differential regulation across cell types being projected onto a single genome -- this is hard to learn w/o functional data!

1 1

Peter Koo @pkoo562.bsky.social · Jul 16

This went 3 rounds of review in another journal, but 1 reviewer was adamant that this type of benchmark might be harmful to the burgeoning gLM field, which currently only benchmarks relative performance on (nearly) useless benchmarks in the non-coding regions. It was rejected!

2 1

Peter Koo @pkoo562.bsky.social · Jul 16

Our work on "Evaluating the representational power of pre-trained DNA language models for regulatory genomics" led by @AmberZqt with help from @NiraliSomia & @stevenyuyy is finally published in Genome Biology! Check it out!

genomebiology.biomedcentral.com/articles/10....

Evaluating the representational power of pre-trained DNA language models for regulatory genomics - Genome Biology

Background The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of ...

genomebiology.biomedcentral.com

2 4 11