Amy Lu
@amyxlu.bsky.social
870 followers 200 following 10 posts
CS PhD Student at UC Berkeley & AI for drug discovery at Prescient Design 🇨🇦
Posts Media Videos Starter Packs
Reposted by Amy Lu
martinsteinegger.bsky.social
Just coincidentally found GenBank Release 84.0 from 1994 in the neighboring lab. Anyone out there with an even older version?
amyxlu.bsky.social
In case you missed our ML for proteins seminar on CHEAP compression for protein embeddings back in October, here it is -- thanks @megthescientist.bsky.social for doing so much for the MLxProteins community 🫶
Reposted by Amy Lu
megthescientist.bsky.social
•introduced “zero shot prediction” as a question of guessing a bioassay’s outcome by likelihoods of pLMs
•commented on biases in evolutionary signals from Tree of life used to train pLMs (a favorite paper I read in 2024: shorturl.at/fbC7g)
amyxlu.bsky.social
Thanks @workshopmlsb.bsky.social for letting us share our work!

🔗📄 bit.ly/plaid-proteins
amyxlu.bsky.social
Another straightforward application is generation, either by next-token sampling or MaskGIT style denoising. We made the tokenized version of CHEAP to do generation, and decided to go with diffusion on continuous embeddings instead — but I think either would’ve worked
Reposted by Amy Lu
kevinkaichuang.bsky.social
We trained a model to co-generate protein sequence and structure by working in the ESMFold latent space, which encodes both. PLAID only requires sequences for training but generates all-atom structures!

Really proud of @amyxlu.bsky.social 's effort leading this project end-to-end!
generations from PLAID The PLAID model architecture Conditional generations from PLAID
amyxlu.bsky.social
immensely grateful for awesome collaborators on this work: Wilson Yan, Sarah Robinson, @kevinkaichuang.bsky.social, Vladimir Gligorijevic, @kyunghyuncho.bsky.social, Rich Bonneau, Pieter Abbeel, @ncfrey.bsky.social 🫶
amyxlu.bsky.social
6/ We'll get to share PLAID as an oral presentation at MLSB next week 🥳 In the meantime, checkout:

📄Preprint: biorxiv.org/content/10.1...
👩‍💻Code: github.com/amyxlu/plaid
🏋️Weights: huggingface.co/amyxlu/plaid...
🌐Website: amyxlu.github.io/plaid/
🍦Server: coming soon!
biorxiv.org
amyxlu.bsky.social
5/🚀 ...and when prompted by function, PLAID learns sequence motifs at active sites & directly outputs sidechain positions, which backbone-only methods such as RFDiffusion can't do out-of-the-box.

The residues aren't directly adjacent, suggesting that the model isn't simply memorizing training data:
conditioning on organism and function shows that PLAID has learned active site residues and sidechain positions!
amyxlu.bsky.social
4/ On unconditional generation, PLAID generates high quality and diverse structures, especially at longer sequence lengths where previous methods underperform...
unconditional generations from PLAID
amyxlu.bsky.social
3/ I was pretty stuck until building out the CHEAP (bit.ly/cheap-proteins) autoencoders that compressed & smoothed out the latent space: interestingly, gradual noise added to the ESMFold latent space doesn't actually corrupt the sequence and structure until the final forward diffusion timesteps 🤔
noising by a diffusion schedule in the latent space doesn't always correspond to the same corruption in the sequence and structure space...
amyxlu.bsky.social
2/💡Co-generating sequence and structure is hard. A key insight is that to get embeddings of the ESMFold latent space during training, we only need sequence inputs.

For inference, we can sample latent embeddings & use frozen sequence/structure decoders to get all-atom structure:
how does the PLAID approach work?
amyxlu.bsky.social
1/🧬 Excited to share PLAID, our new approach for co-generating sequence and all-atom protein structures by sampling from the latent space of ESMFold. This requires only sequences during training, which unlocks more data and annotations:

bit.ly/plaid-proteins
🧵
overview of results for PLAID!