Heng Li
@lh3lh3.bsky.social
1K followers 110 following 42 posts
Associate Professor DFCI & HMS
Posts Media Videos Starter Packs
Back online. Not sure if it is a bug in my code or a hiccup at the hosting service.
Do you know ~60% of human SVs fall in ~1% of GRCh38? See our new preprint: arxiv.org/abs/2509.23057 and the companion blog post on how we started this project and longdust: lh3.github.io/2025/09/29/o.... Work with Alvin Qin
arXiv accepted our assembly review two years ago. That was written in MS Word, so PDF-only. Nonetheless, at that time they didn't require TeX source as I remember. Something might have been changed internally.
And learn what fully AI-generated websites look like. Avoid them, as they are more likely to be scam.
Reposted by Heng Li
jbonfield.bsky.social
Heads up: ignore samtools dot org, similarly minimap2 dot com and likely others. It's owned by a known phishing site and while the binaries they offer look valid currently (but note they may be serving us different binaries to others), that could change.

Ie: it's not us (Samtools team)! Be warned
New blog post – A quick look at Roche's SBX
lh3.github.io/2025/09/11/a...
Reposted by Heng Li
jimshaw.bsky.social
Preprint out for myloasm, our new nanopore / HiFi metagenome assembler!

Nanopore's getting accurate, but

1. Can this lead to better metagenome assemblies?
2. How, algorithmically, to leverage them?

with co-author Max Marin @mgmarin.bsky.social, supervised by Heng Li @lh3lh3.bsky.social

1 / N
biorxiv-bioinfo.bsky.social
High-resolution metagenome assembly for modern long reads with myloasm https://www.biorxiv.org/content/10.1101/2025.09.05.674543v1
Reposted by Heng Li
biorxiv-bioinfo.bsky.social
High-resolution metagenome assembly for modern long reads with myloasm https://www.biorxiv.org/content/10.1101/2025.09.05.674543v1
Of course, also thank Andrea Guarracino and Andrew Carroll for their quick and careful review!
"Received: July 4, 2025. Revised: August 7, 2025. Accepted: August 15, 2025" and published on September 4. This is a simple and straightforward paper, but the speedy editorial process is still impressive. It could have been even faster if I had responded the initial editorial request more timely.
Now published in GigaScience with minor improvements: academic.oup.com/gigascience/...

* Download: zenodo.org/records/1490...
* More info: github.com/lh3/panmask
Preprint on "Finding easy regions for short-read variant calling from pangenome data": arxiv.org/abs/2507.03718
Reposted by Heng Li
tommytang.bsky.social
(Harvard STAT115): Introduction to Bioinformatics and Computational Biology by Shirley Liu.
liulab-dfci.github.io/bioinfo-com...
Timing and practical needs are key factors. BCF also has a proper spec and an okay library and it is based on bgzf. The library is more complex to use because VCF is more complex.
CRAM also ticks some of the points, but the storage cost alone wins users over.
BAM is more of a literal dump. BCF is not used often because 1) VCF is a small fraction of SAM. Performance is not as critical. 2) Tabix is good enough. 3) Too complex to implement as VCF is not designed with binary in mind. 4) Too late. GATK started to support in 2019. 5) Binary version changes.
I think genomic file formats should be text-first and designed with binary representations in mind
Preprint on "Finding easy regions for short-read variant calling from pangenome data": arxiv.org/abs/2507.03718
Pangolin only supports SNPs and doesn't distinguish donor/acceptor but mutating cTaAt has little effect, too. I don't have alphaGenome numbers. Overall, this comes back to @ewanbirney.bsky.social's question: Is BP essential to splicing? Do these models really see BP?
chr3:143021319:cTaAt matches the yTnAy BP consensus at the 1-based coordinate. The Broad's SpliceAI server doesn't think chr3:143021320:TaA>CaG would lead to an acceptor loss – it's non-essential to splicing. SpliceAI thinks donor might be more affected.
Thanks for the explanation! This is a good example then. It will be interesting to see if pangolin or spliceAI can capture these with ISM. Another question is how often BPs are found across all acceptor sites.
IMHO, it is important to understand what a model learns; otherwise, the model might just captures nuances irrelevant to biology. 4/