Lightnews — Scholar-powered news

Charlie Pugh

@cwjpugh.bsky.social

250 followers 1.2K following 10 posts

PhD candidate - Machine Learning and Genomics @CRG.eu with @jonnyfrazer.bsky.social and @MafaldaFigDias

Posts Media Videos Starter Packs

Charlie Pugh @cwjpugh.bsky.social · May 26

We also made some improvements with genomic language model, Evo 2, but in this case the interpretation was less clear. See the preprint for more details. Code for using LFB will made available shortly. 10/10

Charlie Pugh @cwjpugh.bsky.social · May 26

This provides evidence that better fitness estimation can be achieved at negligible computational cost by bridging the gap between likelihood and fitness at inference time. 9/n

1 2

Charlie Pugh @cwjpugh.bsky.social · May 26

This trend held across DMS assay types and mutational depth, and also on prediction of clinical variants. 8/n

We show a scatterplot of ROC-AUCs for each gene, calculated separating benign and pathogenic labelled variants with either usual or LFB fitness estimation

1 1

Charlie Pugh @cwjpugh.bsky.social · May 26

On ProteinGym, LFB provided significant improvements across model classes and sizes and we saw that larger better fit models provided better predictions in general.
proteingym.org 7/n

We show a plot of Model Size vs Mean Spearman Correlation across the DMS datasets from ProteinGym for ESM-2 and ProGen2 model families both with and without the LFB estimation.

1 2

Charlie Pugh @cwjpugh.bsky.social · May 26

We found under an Ornstein–Uhlenbeck model of evolution that our LFB should be lower variance than the standard estimate by marginalising the effect of drift. 6/n

1 2

Charlie Pugh @cwjpugh.bsky.social · May 26

We tried a simple strategy — averaging predictions over sequences under similar selective pressures to effectively reduce the impact of unwanted non-fitness related correlations — likelihood fitness bridging (LFB). 5/n

We show a schematic of the LFB estimate where by averaging over predictions for a variant applied to other related sequences, we produce an score which should be closer to the true change in fitness.

1 1

Charlie Pugh @cwjpugh.bsky.social · May 26

We wondered whether we might be able to improve predictions from existing models without any further training. 4/n

1 1

Charlie Pugh @cwjpugh.bsky.social · May 26

Weinstein et al show that better fit sequence models can perform worse at fitness estimation due to phylogenetic structure:
openreview.net/forum?id=CwG...
And in practice we are seeing that pLMs don’t improve with lower perplexities:
openreview.net/forum?id=UvP... www.biorxiv.org/content/10.1... 3/n

Non-identifiability and the Blessings of Misspecification in Models...

Misspecification is a blessing, not a curse, when estimating protein fitness from evolutionary sequence data using generative models.

openreview.net

1 1

Charlie Pugh @cwjpugh.bsky.social · May 26

Protein language models are showing promise in variant effect prediction - but there’s emerging evidence likelihood based zero shot fitness estimation is breaking down. See this excellent summary from @pascalnotin.bsky.social: pascalnotin.substack.com/p/have-we-hi... 2/n

Have We Hit the Scaling Wall for Protein Language Models?

Beyond Scaling: What Truly Works in Protein Fitness Prediction

pascalnotin.substack.com

1 5

Charlie Pugh @cwjpugh.bsky.social · May 26

New preprint in collaboration with @paulinanunezv.bsky.social supervised by @jonnyfrazer.bsky.social and Mafalda Dias – we propose a simple approach to improving zero-shot variant effect prediction in pre-existing protein and genome language models: 🧶 1/n

www.biorxiv.org/content/10.1...

From Likelihood to Fitness: Improving Variant Effect Prediction in Protein and Genome Language Models

Generative models trained on natural sequences are increasingly used to predict the effects of genetic variation, enabling progress in therapeutic design, disease risk prediction, and synthetic biolog...

www.biorxiv.org

1 23 74

Reposted by Charlie Pugh

Isabelle Zane @isabellease.bsky.social · May 22

@cwjpugh.bsky.social at #VariantEffect25

8 19

Reposted by Charlie Pugh

Kevin K. Yang 楊凱筌 @kevinkaichuang.bsky.social · Dec 3

Three BioML starter packs now!

Pack 1: go.bsky.app/2VWBcCd
Pack 2: go.bsky.app/Bw84Hmc
Pack 3: go.bsky.app/NAKYUok

DM if you want to be included (or nominate people who should be!)

16 60 150

Reposted by Charlie Pugh

iseultleahy.bsky.social @iseultleahy.bsky.social · Nov 28

Thanks Charlie for opening the PhD Symposium! Many thanks to everyone involved in its organisation. #CRGPhDSymp2024

4 7