Lightnews — Scholar-powered news

Reposted by Austin Wang

Anshul Kundaje @anshulkundaje.bsky.social · Jun 19

@saramostafavi.bsky.social (@Genentech) & I (@Stanford) r excited to announce co-advised postdoc positions for candidates with deep expertise in ML for bio (especially sequence to function models, causal perturbational models & single cell models). See details below. Pls RT 1/

1 40 55

Reposted by Austin Wang

Anshul Kundaje @anshulkundaje.bsky.social · May 15

Today was a big day for the lab. We had two back to back thesis defenses and the defenders defended with great science and character.

Congrats to DR. Kelly Cochran & DR. @soumyakundu.bsky.social on this momentous achievement.

Brilliant scientists with brilliant futures ahead. 🎉🎉🎉

2 7 77

Reposted by Austin Wang

Selin Jessa @selinjessa.bsky.social · May 3

Delighted to share our latest work deciphering the landscape of chromatin accessibility and modeling the DNA sequence syntax rules underlying gene regulation during human fetal development! www.biorxiv.org/content/10.1... Read on for more: 🧵 1/16 #GeneReg 🧬🖥️

Dissecting regulatory syntax in human development with scalable multiomics and deep learning

Transcription factors (TFs) establish cell identity during development by binding regulatory DNA in a sequence-specific manner, often promoting local chromatin accessibility, and regulating gene expre...

www.biorxiv.org

2 59 130

Reposted by Austin Wang

Jacob Schreiber @jmschreiber91.bsky.social · Apr 24

Our preprint on designing and editing cis-regulatory elements using Ledidi is out! Ledidi turns *any* ML model (or set of models) into a designer of edits to DNA sequences that induce desired characteristics.

Preprint: www.biorxiv.org/content/10.1...
GitHub: github.com/jmschrei/led...

Programmatic design and editing of cis-regulatory elements

The development of modern genome editing tools has enabled researchers to make such edits with high precision but has left unsolved the problem of designing these edits. As a solution, we propose Ledi...

www.biorxiv.org

2 37 120

Reposted by Austin Wang

Anshul Kundaje @anshulkundaje.bsky.social · Jan 7

Very excited to announce that the single cell/nuc. RNA/ATAC/multi-ome resource from ENCODE4 is now officially public. This includes raw data, processed data, annotations and pseudobulk products. Covers many human & mouse tissues. 1/

www.encodeproject.org/single-cell/...

Single cell – ENCODEHomo sapiens clickable body map

www.encodeproject.org

6 86 290

Reposted by Austin Wang

Anshul Kundaje @anshulkundaje.bsky.social · Dec 25

Our ChromBPNet preprint out!

www.biorxiv.org/content/10.1...

Huge congrats to Anusri! This was quite a slog (for both of us) but we r very proud of this one! It is a long read but worth it IMHO. Methods r in the supp. materials. Bluetorial coming soon below 1/

8 89 230

Austin Wang @austintwang.bsky.social · Dec 14

I think that’ll be interesting to look more into! The profile information does not convey overall accessibility since it’s normalized, but maybe this sort of multitasking could help.

1

Austin Wang @austintwang.bsky.social · Dec 11

Thank you for the kind words! Yes, ChromBPNet uses unmodified models, which includes profile data and a bias model. However these evaluations use only the count head.

1 1

Reposted by Austin Wang

Arpita Singhal @arpita-s.bsky.social · Dec 11

Excited to announce DART-Eval, our latest work on benchmarking DNALMs! Catch us at #NeurIPS!

Austin Wang @austintwang.bsky.social · Dec 11

(1/10) Excited to announce our latest work! @arpita-s.bsky.social, @amanpatel100.bsky.social , and I will be presenting DART-Eval, a rigorous suite of evals for DNA Language Models on transcriptional regulatory DNA at #NeurIPS2024. Check it out! arxiv.org/abs/2412.05430

DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA

Recent advances in self-supervised models for natural language, vision, and protein sequences have inspired the development of large genomic DNA language models (DNALMs). These models aim to learn gen...

arxiv.org

5 8

Reposted by Austin Wang

amanpatel100.bsky.social @amanpatel100.bsky.social · Dec 11

New work! Come check out our poster tomorrow and take a look at the paper!

Austin Wang @austintwang.bsky.social · Dec 11

(1/10) Excited to announce our latest work! @arpita-s.bsky.social, @amanpatel100.bsky.social , and I will be presenting DART-Eval, a rigorous suite of evals for DNA Language Models on transcriptional regulatory DNA at #NeurIPS2024. Check it out! arxiv.org/abs/2412.05430

DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA

Recent advances in self-supervised models for natural language, vision, and protein sequences have inspired the development of large genomic DNA language models (DNALMs). These models aim to learn gen...

arxiv.org

3 5

Austin Wang @austintwang.bsky.social · Dec 11

(10/10) Come check out our poster (tomorrow Dec 11 at 11 AM) or read the paper for more details!

arxiv.org/abs/2412.05430

github.com/kundajelab/D...

neurips.cc/virtual/2024...

#machinelearning #NeurIPS2024 #genomics

DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA

Recent advances in self-supervised models for natural language, vision, and protein sequences have inspired the development of large genomic DNA language models (DNALMs). These models aim to learn gen...

arxiv.org

1 8

Austin Wang @austintwang.bsky.social · Dec 11

(9/10) How do we train more effective DNALMs? Use better data and objectives:
• Nailing short-context tasks before long-context
• Data sampling to account for class imbalance
• Conditioning on cell type context
These strategies use external annotations, which are plentiful!

1 1 7

Austin Wang @austintwang.bsky.social · Dec 11

(8/10) This indicates that DNALMs inconsistently learn functional DNA. We believe that the culprit is not architecture, but rather the sparse and imbalanced distribution of functional DNA elements.

Given their resource requirements, current DNALMs are a hard sell.

1 1 7

Austin Wang @austintwang.bsky.social · Dec 11

(7/10) DNALMs struggle with more difficult tasks.
Furthermore, small models trained from scratch (<10M params) routinely outperform much larger DNALMs (>1B params), even after LoRA fine-tuning!
Our results on the hardest task - counterfactual variant effect prediction.

3 1 6

Austin Wang @austintwang.bsky.social · Dec 11

(6/10) We introduce DART-Eval, a suite of five biologically informed DNALM evaluations focusing on transcriptional regulatory DNA ordered by increasing difficulty.

1 1 4

Austin Wang @austintwang.bsky.social · Dec 11

(5/10) Rigorous evaluations of DNALMs, though critical, are lacking. Existing benchmarks:
• Focus on surrogate tasks tenuously related to practical use cases
• Suffer from inadequate controls and other dataset design flaws
• Compare against outdated or inappropriate baselines

1 2

Austin Wang @austintwang.bsky.social · Dec 11

(4/10) An effective DNALM should:
• Learn representations that can accurately distinguish different types of functional DNA elements
• Serve as a foundation for downstream supervised models
• Outperform models trained from scratch

1 2

Austin Wang @austintwang.bsky.social · Dec 11

(3/10) However, DNA is vastly different from text, being much more heterogeneous, imbalanced, and sparse. Imagine a blend of several different languages interspersed with a load of gibberish.

1 3 8

Austin Wang @austintwang.bsky.social · Dec 11

(2/10) DNALMs are a new class of self-supervised models for DNA, inspired by the success of LLMs. These DNALMs are often pre-trained solely on genomic DNA without considering any external annotations.

1 3

Austin Wang @austintwang.bsky.social · Dec 11

(1/10) Excited to announce our latest work! @arpita-s.bsky.social, @amanpatel100.bsky.social , and I will be presenting DART-Eval, a rigorous suite of evals for DNA Language Models on transcriptional regulatory DNA at #NeurIPS2024. Check it out! arxiv.org/abs/2412.05430

DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA

Recent advances in self-supervised models for natural language, vision, and protein sequences have inspired the development of large genomic DNA language models (DNALMs). These models aim to learn gen...

arxiv.org

1 27 70