Anthony Gitter
@anthonygitter.bsky.social
69 followers 30 following 25 posts
Computational biologist; Associate Prof. at University of Wisconsin-Madison; Jeanne M. Rowe Chair at Morgridge Institute
Posts Media Videos Starter Packs
anthonygitter.bsky.social
MPAC uses PARADIGM as the probabilistic model but makes many improvements:
- data-driven omic data discretization
- permutation testing to eliminate spurious predictions
- full workflow and downstream analyses in an R package
- Shiny app for interactive visualization
anthonygitter.bsky.social
The journal version of our Multi-omic Pathway Analysis of Cells (MPAC) software is now out: doi.org/10.1093/bioi...

MPAC uses biological pathway graphs to model DNA copy number and gene expression changes and infer activity states of all pathway members.
Overview of the MPAC workflow. MPAC calculates inferred pathway levels (IPLs) from real and permuted CNA and RNA data. It filters real IPLs using the permuted IPLs to remove spurious IPLs. Then, MPAC focuses on the largest pathway subset network with filtered IPLs to compute GO term enrichment, predict patient groups, and identify key group-specific proteins.
Reposted by Anthony Gitter
lindorfflarsen.bsky.social
Does anyone know whether there's a functioning API to ESMfold?

(api.esmatlas.com/foldSequence... gives me Service Temporarily Unavailable)
anthonygitter.bsky.social
We can use METL for low-N protein design. We trained METL on Rosetta simulations of GFP biophysical attributes and only 64 experimental examples of GFP brightness. It designed fluorescent 5 and 10 mutants, including some with mutants entirely outside training set mutations. 7/
Fig. 6: Low-N GFP design.
anthonygitter.bsky.social
A powerful aspect of pretraining on biophysical simulations is that the simulations can be customized to match the protein function and experimental assay. Our expanded simulations of the GB1-IgG complex with Rosetta InterfaceAnalyzer improve METL predictions of GB1 binding. 6/
Fig. 5: Function-specific simulations improve METL pretraining for GB1.
anthonygitter.bsky.social
We also benchmark METL on four types of difficult extrapolation. For instance, positional extrapolation provides training data from some sequence positions and tests predictions at different sequence positions. Linear regression completely fails in this setting. 5/
Fig. 3: Comparative performance across extrapolation tasks.
anthonygitter.bsky.social
We compare these approaches on deep mutational scanning datasets with increasing training set sizes. Biophysical pretraining helps METL generalize well with small training sets. However, augmented linear regression with EVE scores is great on some of these assays. 4/
Fig. 2: Comparative performance of Linear, Rosetta total score, EVE, RaSP, Linear-EVE, ESM-2, ProteinNPT, METL-Global and METL-Local across different training set sizes.
anthonygitter.bsky.social
METL models pretrained on Rosetta biophysical attributes learn different protein representations than general protein language models like ESM-2 or protein family-specific models like EVE. These new representations are valuable for machine learning-guided protein engineering. 3/
anthonygitter.bsky.social
Most protein language models train on natural protein sequence data and use the underlying evolutionary signals to score sequence variants. Instead, METL trains on @rosettacommons.bsky.social data, learning from simulated biophyiscal attributes of the sequence variants we select. 2/
anthonygitter.bsky.social
The journal version of our paper 'Chemical Language Model Linker: Blending Text and Molecules with Modular Adapters' is out doi.org/10.1021/acs....

ChemLML is a method for text-based conditional molecule generation that uses pretrained text models like SciBERT, Galactica, or T5.
Chemical Language Model Linker: Blending Text and Molecules with Modular Adapters
The development of large language models and multimodal models has enabled the appealing idea of generating novel molecules from text descriptions. Generative modeling would shift the paradigm from relying on large-scale chemical screening to find molecules with desired properties to directly generating those molecules. However, multimodal models combining text and molecules are often trained from scratch, without leveraging existing high-quality pretrained models. Training from scratch consumes more computational resources and prohibits model scaling. In contrast, we propose a lightweight adapter-based strategy named Chemical Language Model Linker (ChemLML). ChemLML blends the two single domain models and obtains conditional molecular generation from text descriptions while still operating in the specialized embedding spaces of the molecular domain. ChemLML can tailor diverse pretrained text models for molecule generation by training relatively few adapter parameters. We find that the choice of molecular representation used within ChemLML, SMILES versus SELFIES, has a strong influence on conditional molecular generation performance. SMILES is often preferable despite not guaranteeing valid molecules. We raise issues in using the entire PubChem data set of molecules and their associated descriptions for evaluating molecule generation and provide a filtered version of the data set as a generation test set. To demonstrate how ChemLML could be used in practice, we generate candidate protein inhibitors and use docking to assess their quality and also generate candidate membrane permeable molecules.
doi.org
Reposted by Anthony Gitter
sarahgurev.bsky.social
🚨New paper 🚨

Can protein language models help us fight viral outbreaks? Not yet. Here’s why 🧵👇
1/12
anthonygitter.bsky.social
There are many more results and controls in the paper. Here's how the best (most negative) docking scores change when we use relevant assays, irrelevant assays, or no assays as context for generation with GPT-4o. In the majority of cases, but not all, relevant context helps. 6/
Distribution of the top 10 docking scores from molecules with high- and low-relevance BioAssays as context for different proteins.
anthonygitter.bsky.social
This generally has the desired effects across multiple LLMs and queried protein targets, with the caveat that our core results are based on AutoDock Vina scores. Assessing generated molecules with docking is admittedly frustrating. 5/
anthonygitter.bsky.social
We embed the BioAssay data into a vectorbase, retrieve initial candidate assays, and do further LLM-based filtering and summarization. We select some active and inactive molecules from the BioAssay data table. This is all used for in-context learning and molecule generation. 4/
anthonygitter.bsky.social
PubChem BioAssays can contain a lot of information about why and how an assay was run. Here's an example from our collaborators. pubchem.ncbi.nlm.nih.gov/bioassay/127...

There are now 1.7M PubChem BioAssays ranging in scale from a few tested molecules to high-throughput screens. 2/
SSB-PriA antibiotic resistant target AlphaScreen
anthonygitter.bsky.social
Our preprint Assay2Mol introduces uses PubChem chemical screening data as context when generating molecules with large language models. It uses assay descriptions and protocols to find relevant assays and that text plus active/inactive molecules as context for generation. 1/
The Assay2Mol workflow. A chemist provides a target description, which is used to retrieve BioAssays from the pre-embedded vector database. After filtering for relevance, the BioAssays are summarized by an LLM. The BioAssay ID is then used to retrieve experimental tables. The final molecule generation prompt is formed by combining the description, summarization, and selected test molecules with associated test outcomes, enabling the LLM to generate relevant active molecules.
Reposted by Anthony Gitter
delalamo.xyz
Nobody is commenting on this little nugget from Fig 1?