Jacob Schreiber
@jmschreiber91.bsky.social
6.6K followers 1.4K following 810 posts
Studying genomics, machine learning, and fruit. My code is like our genomes -- most of it is junk. Assistant Professor UMass Chan, Board of Directors NumFOCUS Previously IMP Vienna, Stanford Genetics, UW CSE.
Posts Media Videos Starter Packs
Pinned
jmschreiber91.bsky.social
As a field, I believe we must move towards COMIC SANS logo plots immediately.

nbviewer.org/github/saket...
jmschreiber91.bsky.social
I figure I can spend the time until I get tenure on answering the question, and then the time after tenure arguing about what "regulatory" means
jmschreiber91.bsky.social
If you're interested, please reach out with your CV and which topics you'd be interested in working on!
jmschreiber91.bsky.social
- Genomics Software Ecosystem: A major obstacle to our goal is the lack of simple+scalable software that everyone can use. Come build this with me. Training a lightweight deep learning model and using it for design/interpretability/VE prediction should be no more challenging than mapping reads.
jmschreiber91.bsky.social
- Foundation Models: As someone involved in ML, I am legally required to be working on this topic.
jmschreiber91.bsky.social
We have an array of ML-based projects for going after this, focusing on the following topics:

- DNA Design ( 🧬 ) We have shown that Ledidi (www.biorxiv.org/content/10.1...) can precisely design DNA, and now it's time to push the boundaries in several directions w/ some very cool collaborations.
Programmatic design and editing of cis-regulatory elements
The development of modern genome editing tools has enabled researchers to make such edits with high precision but has left unsolved the problem of designing these edits. As a solution, we propose Ledi...
www.biorxiv.org
jmschreiber91.bsky.social
Now that I'm settled in at @umasschan.bsky.social, I'm hiring at all levels: grad students, post-docs, and software engineers/bioinformaticians!

The goal of my lab is to understand the regulatory role of every nucleotide in our genomes and how this changes across every cell in our bodies.
jmschreiber91.bsky.social
It was suggested that the audience may not appreciate/understand :(
jmschreiber91.bsky.social
is it a good idea to wear a "join, or die!" hat to a big talk in europe? please say yes
jmschreiber91.bsky.social
the greatest productivity hack is having a grant deadline. there's so much other stuff you can do when you're supposed to be working on a grant.
jmschreiber91.bsky.social
I was delighted to have the unexpected opportunity to give a keynote at MLCB 2025 in NYC last week. I used it to explain how I view deep learning models in genomics not as "uninterpretable black boxes" but as indispensable tools for understanding genomics + designing the next gen of synthetic DNA.
jmschreiber91.bsky.social
for some reason i thought being a professor would involve more mentoring and research and less filling out disclosures concerning whether plants and seeds were used in my computational study
jmschreiber91.bsky.social
stocking up the new apartment with essentials
Reposted by Jacob Schreiber
jmschreiber91.bsky.social
In the genomics community, we have focused pretty heavily on achieving state-of-the-art predictive performance.

While undoubtedly important, how we *use* these models after training is potentially even more important.

tangermeme v1.0.0 is out now. Hope you find it useful!
jmschreiber91.bsky.social
For some reason, hitting "comment" on GitHub is significantly more responsive than a month ago and it freaks me out. Surely there are some important calculations that need to be done before letting my thoughts into the wild?
jmschreiber91.bsky.social
Thanks! Let me know if you want me to stop in virtually, we can try to figure out a time.
jmschreiber91.bsky.social
Hope you find tangermeme helpful in your work! Please reach out if you have any comments + questions.
jmschreiber91.bsky.social
Because everything is automatic, we can probe models.

What motifs are driving model predictions? Calculate attributions, call + annotate seqlets, and count the annotations!

BPNet is relying on MYC, whereas Beluga is relying on many more TFs. Easy comparison now.
jmschreiber91.bsky.social
Frequently, people manually annotate seqlets and draw bars or boxes around these high-attribution characters themselves. This is not really a problem, but it's just slow and does not scale genome-wide.

In the above picture, everything is automatically done.
jmschreiber91.bsky.social
People *talk* about seqlets a lot but tangermeme is the first package for complete functionality.

Here is a complete example of using tangermeme for attributions, seqlet calling + annotation, and plotting, to visualize what five models think of the same locus
jmschreiber91.bsky.social
Expanding past these implementations, tangermeme has a large focus on automatic seqlet calling and usage. Seqlets are short contiguous spans of high-attribution characters that usually correspond to the binding of a TF.
jmschreiber91.bsky.social
By considering attributions you can see how variants disrupt or change usage of motifs. Maybe you'll even find that a variant causes alternative binding by inducing a new motif or slightly changing competition! That would be challenging to see from the predictions alone.
jmschreiber91.bsky.social
Past simply re-implementing algorithms people use (in a convenient repo), tangermeme offers flexibility not usually offers in other implementations.

As an example, instead of calculating variant effect as predictions before/after a substitution, why not look at attributions?
jmschreiber91.bsky.social
This care extends to each of our operations. For example, one-hot encoding the entirety of chr1 takes <2s on a single thread. This is significantly faster than other one-hot encoding methods out there, and is fast enough to enable real-time batch generation from FASTAs.
jmschreiber91.bsky.social
By focusing in this manner, we can "delve" deeply into these downstream algorithms. For instance, we found a bug in many DeepLIFT/SHAP implementations that will cause them to silently fail when you don't register your operations. Didn't know you needed to do that? Same!