Lightnews — Scholar-powered news

Jacob Schreiber

@jmschreiber91.bsky.social

6.6K followers 1.4K following 810 posts

Studying genomics, machine learning, and fruit. My code is like our genomes -- most of it is junk. Assistant Professor UMass Chan, Board of Directors NumFOCUS Previously IMP Vienna, Stanford Genetics, UW CSE.

Posts Media Videos Starter Packs

Pinned

Jacob Schreiber @jmschreiber91.bsky.social · Mar 14

As a field, I believe we must move towards COMIC SANS logo plots immediately.

nbviewer.org/github/saket...

5 11 110

Jacob Schreiber @jmschreiber91.bsky.social · 1d

I figure I can spend the time until I get tenure on answering the question, and then the time after tenure arguing about what "regulatory" means

1 6

Jacob Schreiber @jmschreiber91.bsky.social · 1d

If you're interested, please reach out with your CV and which topics you'd be interested in working on!

Jacob Schreiber @jmschreiber91.bsky.social · 1d

- Genomics Software Ecosystem: A major obstacle to our goal is the lack of simple+scalable software that everyone can use. Come build this with me. Training a lightweight deep learning model and using it for design/interpretability/VE prediction should be no more challenging than mapping reads.

1 4

Jacob Schreiber @jmschreiber91.bsky.social · 1d

- Foundation Models: As someone involved in ML, I am legally required to be working on this topic.

1 1

Jacob Schreiber @jmschreiber91.bsky.social · 1d

We have an array of ML-based projects for going after this, focusing on the following topics:

- DNA Design ( 🧬 ) We have shown that Ledidi (www.biorxiv.org/content/10.1...) can precisely design DNA, and now it's time to push the boundaries in several directions w/ some very cool collaborations.

Programmatic design and editing of cis-regulatory elements

The development of modern genome editing tools has enabled researchers to make such edits with high precision but has left unsolved the problem of designing these edits. As a solution, we propose Ledi...

www.biorxiv.org

1 1

Jacob Schreiber @jmschreiber91.bsky.social · 1d

Now that I'm settled in at @umasschan.bsky.social, I'm hiring at all levels: grad students, post-docs, and software engineers/bioinformaticians!

The goal of my lab is to understand the regulatory role of every nucleotide in our genomes and how this changes across every cell in our bodies.

4 13 24

Jacob Schreiber @jmschreiber91.bsky.social · 6d

It was suggested that the audience may not appreciate/understand :(

Jacob Schreiber @jmschreiber91.bsky.social · 8d

is it a good idea to wear a "join, or die!" hat to a big talk in europe? please say yes

1 3

Jacob Schreiber @jmschreiber91.bsky.social · 16d

the greatest productivity hack is having a grant deadline. there's so much other stuff you can do when you're supposed to be working on a grant.

Jacob Schreiber @jmschreiber91.bsky.social · 19d

I was delighted to have the unexpected opportunity to give a keynote at MLCB 2025 in NYC last week. I used it to explain how I view deep learning models in genomics not as "uninterpretable black boxes" but as indispensable tools for understanding genomics + designing the next gen of synthetic DNA.

1 11

Jacob Schreiber @jmschreiber91.bsky.social · 29d

for some reason i thought being a professor would involve more mentoring and research and less filling out disclosures concerning whether plants and seeds were used in my computational study

1 10

Jacob Schreiber @jmschreiber91.bsky.social · Sep 3

stocking up the new apartment with essentials

Reposted by Jacob Schreiber

Jacob Schreiber @jmschreiber91.bsky.social · Aug 27

In the genomics community, we have focused pretty heavily on achieving state-of-the-art predictive performance.

While undoubtedly important, how we *use* these models after training is potentially even more important.

tangermeme v1.0.0 is out now. Hope you find it useful!

1 14 44

Jacob Schreiber @jmschreiber91.bsky.social · Aug 27

For some reason, hitting "comment" on GitHub is significantly more responsive than a month ago and it freaks me out. Surely there are some important calculations that need to be done before letting my thoughts into the wild?

Jacob Schreiber @jmschreiber91.bsky.social · Aug 27

Thanks! Let me know if you want me to stop in virtually, we can try to figure out a time.

Jacob Schreiber @jmschreiber91.bsky.social · Aug 27

Hope you find tangermeme helpful in your work! Please reach out if you have any comments + questions.

Jacob Schreiber @jmschreiber91.bsky.social · Aug 27

Because everything is automatic, we can probe models.

What motifs are driving model predictions? Calculate attributions, call + annotate seqlets, and count the annotations!

BPNet is relying on MYC, whereas Beluga is relying on many more TFs. Easy comparison now.

1 1

Jacob Schreiber @jmschreiber91.bsky.social · Aug 27

Frequently, people manually annotate seqlets and draw bars or boxes around these high-attribution characters themselves. This is not really a problem, but it's just slow and does not scale genome-wide.

In the above picture, everything is automatically done.

1 1 1

Jacob Schreiber @jmschreiber91.bsky.social · Aug 27

People *talk* about seqlets a lot but tangermeme is the first package for complete functionality.

Here is a complete example of using tangermeme for attributions, seqlet calling + annotation, and plotting, to visualize what five models think of the same locus

1 3

Jacob Schreiber @jmschreiber91.bsky.social · Aug 27

Expanding past these implementations, tangermeme has a large focus on automatic seqlet calling and usage. Seqlets are short contiguous spans of high-attribution characters that usually correspond to the binding of a TF.

Jacob Schreiber @jmschreiber91.bsky.social · Aug 27

By considering attributions you can see how variants disrupt or change usage of motifs. Maybe you'll even find that a variant causes alternative binding by inducing a new motif or slightly changing competition! That would be challenging to see from the predictions alone.

1 1

Jacob Schreiber @jmschreiber91.bsky.social · Aug 27

Past simply re-implementing algorithms people use (in a convenient repo), tangermeme offers flexibility not usually offers in other implementations.

As an example, instead of calculating variant effect as predictions before/after a substitution, why not look at attributions?

Jacob Schreiber @jmschreiber91.bsky.social · Aug 27

This care extends to each of our operations. For example, one-hot encoding the entirety of chr1 takes <2s on a single thread. This is significantly faster than other one-hot encoding methods out there, and is fast enough to enable real-time batch generation from FASTAs.

1 3

Jacob Schreiber @jmschreiber91.bsky.social · Aug 27

Here is a (twitter) thread on the issue:

x.com/jmschreiber9...

Jacob Schreiber on X: "Using Captum to interpret your @PyTorch models using DeepLift/DeepLiftShap? If you specify your activations incorrectly, you will silently get incorrect attributions. In this genomics example, the TTTGCAT.ACAAT motif is the important thing and is entirely missed. https://t.co/MuVsAO5isz" / X

Using Captum to interpret your @PyTorch models using DeepLift/DeepLiftShap? If you specify your activations incorrectly, you will silently get incorrect attributions. In this genomics example, the TTTGCAT.ACAAT motif is the important thing and is entirely missed. https://t.co/MuVsAO5isz

x.com

Jacob Schreiber @jmschreiber91.bsky.social · Aug 27

By focusing in this manner, we can "delve" deeply into these downstream algorithms. For instance, we found a bug in many DeepLIFT/SHAP implementations that will cause them to silently fail when you don't register your operations. Didn't know you needed to do that? Same!

1 2