Maciek Wiatrak
@macwiatrak.bsky.social
39 followers 1 following 16 posts
PhD student @ Cambridge Centre for AI in Medicine (CCAIM). I do 💻 🧬 💊 and love ⛰️.
Posts Media Videos Starter Packs
macwiatrak.bsky.social
If you'd like to collaborate, get in touch at [email protected], or DM me directly.

🧵 16/n
macwiatrak.bsky.social
Most importantly, huge shoutout to everyone who made this possible! Thanks to them, the project was first and foremost immense fun!

🧵 15/n
macwiatrak.bsky.social
We hope that Bacformer will accelerate microbial discovery and we are excited for the future work!

That’s it! Thanks for sticking with me through this thread!

🧵 14/n
macwiatrak.bsky.social
Finally, we used Bacformer to generate sequences of protein families given a prompt. Bacformer generates sequences which span essential functions and resemble real genomes. We also used Bacformer to generate sequences for a desired traits, like oxygen requirement.

🧵 13/n
macwiatrak.bsky.social
Accurately predicting phenotypic traits opens a possibility to discover the genes that are likely causally associated with specific traits. We used gradient-based attribution to identify the genes involved in sporulation and host adaptation.

🧵 12/n
macwiatrak.bsky.social
Bacformer can also predict diverse phenotypic traits! We trained Bacformer to predict 139 phenotypes from the genome alone.

We then used high performing phenotypes to annotate our corpus of >1.3M genomes with 32 diverse phenotypic traits.

🧵 11/n
macwiatrak.bsky.social
By leveraging the whole-genome context, we show how Bacformer boosts performance on gene essentiality and protein function annotation task.

To us, this make sense as gene's essentiality and function is often tied with its genomic neighborhoud in bacteria.

🧵 10/n
macwiatrak.bsky.social
It can also be used to predict the protein-protein interactions across diverse bacteria.

To do it, we fine-tuned the model on STRING DB and used it to predict the interactome of P. aeruginosa, with the top scoring pairs showing high-confidence interfaces based on AF3.

🧵 9/n
macwiatrak.bsky.social
It also nails operon detection! We validated our predictions with long-read RNA sequencing, showing how Bacformer can be used for operon identification even in a zero-shot setup.

🧵 8/n
macwiatrak.bsky.social
Bacformer can be used for a range of bacterial genomics tasks zero-shot or finetuned.

We examined whether Bacformer can uncover the evolutionary relationships by examining if the genome embeddings from Bacformer can be used for clustering without any "species" token.

🧵 7/n
macwiatrak.bsky.social
If each token is a protein, how do we do pretraining if the space of possible proteins is effectively unbound? We leverage the similarities between proteins and create a discrete vocabulary of 50,000 “protein family” clusters using protein embeddings from a pLM.

🧵 6/n
macwiatrak.bsky.social
To capture patterns across diverse bacteria, we pretrained Bacformer on a curated corpus of over 1.3M metagenomes spanning over 25,000 species across diverse environments and containing almost 3B proteins.

🧵 5/n
macwiatrak.bsky.social
Bacformer represents each genome as a sequence of proteins, ordered by their location on the genome, providing a unified representation across bacterial species, and allowing us to learn evolutionary patterns across all bacteria, rather than a single species.

🧵 4/n
macwiatrak.bsky.social
Why make ML models for bacteria🦠? 1️⃣Bacteria shape every ecosystem, and our own health. 2️⃣They are easier to model than mammalian cells: their genomes are small, compact and mostly coding’; they have no real epigenome.3️⃣We have a lot of data to (thanks to metagenomics)!

🧵 3/n
macwiatrak.bsky.social
💥 Excited to introduce Bacformer 🦠 - the first foundation model for bacterial genomics. Bacformer represents genomes as sequences of ordered proteins, learning the “grammar” of how genes are arranged, interact and evolve.

Preprint 📝: biorxiv.org/content/10.1...

🧵 1/n