Ian Shi
@heyitsmeianshi.bsky.social
14 followers 15 following 13 posts
PhD Student @ University of Toronto Building foundation models for genomics!
Posts Media Videos Starter Packs
Pinned
heyitsmeianshi.bsky.social
We're excited to release 𝐦𝐑𝐍𝐀𝐁𝐞𝐧𝐜𝐡, a new benchmark suite for mRNA biology containing 10 diverse datasets with 59 prediction tasks, evaluating 18 foundation model families.

Paper: biorxiv.org/content/10.1...
GitHub: github.com/morrislab/mR...
Blog: blank.bio/post/mrnabench
heyitsmeianshi.bsky.social
And also to our collaborators who provided datasets and thoughtful advice including Simran, Andrew, Cyrus, Defne, Jessica, Kaitlin, Ilyes, @bowang87.bsky.social and @quaidmorris.bsky.social! (+ MSK HPC for making this all possible)
heyitsmeianshi.bsky.social
A huge thanks to @taykhoomdalal.bsky.social, @phil-fradkin.bsky.social, and Divya for pushing this work across the finish line!
heyitsmeianshi.bsky.social
As seen, most models struggle to compositionally generalize, representing a significant gap in their abilities to truly understand regulation.

We hope that this experimental setup and others like it can inform new directions for nucleotide foundation model development.
heyitsmeianshi.bsky.social
(4) Together with Divya Koyyalagunta, we further assess the ability for foundation models to compositionally generalize from learned motifs.

Models are exposed to either sequence element that promotes translation, but never both, and we task them with predicting the unseen combination.
heyitsmeianshi.bsky.social
(3) Finally, we assess the limitations of current benchmarking and modelling efforts.

A common source of data leakage is sequence homology, leading to overestimation of performance without careful data splits. We demonstrate the impact of improper splitting in our tasks.
heyitsmeianshi.bsky.social
@taykhoomdalal.bsky.social further explored this phenomenon and developed a joint CL + MLM objective function, and demonstrated that the joint loss results in superior downstream performance. Remarkably, adding an MLM objective to Orthrus yields SOTA results using 700x less parameters.
heyitsmeianshi.bsky.social
(2) Choice of pre-training objective has a noticeable impact on downstream performance.

Orthrus, trained using contrastive learning (CL), performs better on "global" sequence-level property prediction compared to finer-resolution tasks, consistent with known CL limitations.
heyitsmeianshi.bsky.social
In one of the coolest analyses of this paper, @phil-fradkin.bsky.social quantified the distributional differences between mRNA, ncRNA, and genomic regions through their cross-compressibility under a Huffman encoding scheme, reinforcing their distinct regulatory grammars
heyitsmeianshi.bsky.social
1) Unsurprisingly, we find that models pre-trained on mRNA sequences perform better on downstream mRNA tasks.

While that makes biological sense, the result might be naively counterintuitive -- since mRNAs arise from the genome, shouldn't genomic models be able to model mRNA?
heyitsmeianshi.bsky.social
On these datasets, we assess the embedding quality of almost all existing nucleotide foundation models including Evo2, RiNALMo, AIDO.RNA, Orthrus, SpliceBERT, and others.

Using linear probing, we conduct over 100K experiments, revealing several insights:
heyitsmeianshi.bsky.social
In contrast to existing benchmarks, 𝐦𝐑𝐍𝐀𝐁𝐞𝐧𝐜𝐡 focuses on mRNA biology, assessing prediction of:

- mRNA stability
- Mean ribosome loading
- mRNA sub-cellular localization
- RNA-Protein interaction
- Pathogenicity of variants
heyitsmeianshi.bsky.social
We're excited to release 𝐦𝐑𝐍𝐀𝐁𝐞𝐧𝐜𝐡, a new benchmark suite for mRNA biology containing 10 diverse datasets with 59 prediction tasks, evaluating 18 foundation model families.

Paper: biorxiv.org/content/10.1...
GitHub: github.com/morrislab/mR...
Blog: blank.bio/post/mrnabench