Lightnews — Scholar-powered news

Heng Li @lh3lh3.bsky.social · 7d

Back online. Not sure if it is a bug in my code or a hiccup at the hosting service.

Heng Li @lh3lh3.bsky.social · 8d

Do you know ~60% of human SVs fall in ~1% of GRCh38? See our new preprint: arxiv.org/abs/2509.23057 and the companion blog post on how we started this project and longdust: lh3.github.io/2025/09/29/o.... Work with Alvin Qin

27 75

Heng Li @lh3lh3.bsky.social · 9d

arXiv accepted our assembly review two years ago. That was written in MS Word, so PDF-only. Nonetheless, at that time they didn't require TeX source as I remember. Something might have been changed internally.

2

Heng Li @lh3lh3.bsky.social · 23d

And learn what fully AI-generated websites look like. Avoid them, as they are more likely to be scam.

1 3 9

Reposted by Heng Li

James Bonfield @jbonfield.bsky.social · 23d

Heads up: ignore samtools dot org, similarly minimap2 dot com and likely others. It's owned by a known phishing site and while the binaries they offer look valid currently (but note they may be serving us different binaries to others), that could change.

Ie: it's not us (Samtools team)! Be warned

2 130 140

Heng Li @lh3lh3.bsky.social · 26d

New blog post – A quick look at Roche's SBX
lh3.github.io/2025/09/11/a...

2 30 56

Heng Li @lh3lh3.bsky.social · 28d

Now preprinted at arxiv.org/abs/2509.07357

7 21

Heng Li @lh3lh3.bsky.social · 29d

minimap2.com is potentially a phishing site. Please don't use anything from that website.
github.com/lh3/minimap2...

Phishing site : minimap2.com · Issue #1316 · lh3/minimap2

Not sure how to label this one, but I have come across a website minimap2.com which appears to be AI generated but is serving it's own copy of the Github repository. If you search the address or em...

github.com

1 27 26

Reposted by Heng Li

Jim Shaw @jimshaw.bsky.social · Sep 7

Preprint out for myloasm, our new nanopore / HiFi metagenome assembler!

Nanopore's getting accurate, but

1. Can this lead to better metagenome assemblies?
2. How, algorithmically, to leverage them?

with co-author Max Marin @mgmarin.bsky.social, supervised by Heng Li @lh3lh3.bsky.social

1 / N

bioRxiv Bioinfo @biorxiv-bioinfo.bsky.social · Sep 7

High-resolution metagenome assembly for modern long reads with myloasm https://www.biorxiv.org/content/10.1101/2025.09.05.674543v1

5 76 110

Reposted by Heng Li

bioRxiv Bioinfo @biorxiv-bioinfo.bsky.social · Sep 7

High-resolution metagenome assembly for modern long reads with myloasm https://www.biorxiv.org/content/10.1101/2025.09.05.674543v1

8 18

Heng Li @lh3lh3.bsky.social · Sep 4

Of course, also thank Andrea Guarracino and Andrew Carroll for their quick and careful review!

3

Heng Li @lh3lh3.bsky.social · Sep 4

"Received: July 4, 2025. Revised: August 7, 2025. Accepted: August 15, 2025" and published on September 4. This is a simple and straightforward paper, but the speedy editorial process is still impressive. It could have been even faster if I had responded the initial editorial request more timely.

2 1 10

Heng Li @lh3lh3.bsky.social · Sep 4

Now published in GigaScience with minor improvements: academic.oup.com/gigascience/...

* Download: zenodo.org/records/1490...
* More info: github.com/lh3/panmask

Heng Li @lh3lh3.bsky.social · Jul 8

Preprint on "Finding easy regions for short-read variant calling from pangenome data": arxiv.org/abs/2507.03718

1 10 30

Reposted by Heng Li

Ming Tommy Tang @tommytang.bsky.social · Aug 27

(Harvard STAT115): Introduction to Bioinformatics and Computational Biology by Shirley Liu.
liulab-dfci.github.io/bioinfo-com...

3 14

Heng Li @lh3lh3.bsky.social · Aug 7

Timing and practical needs are key factors. BCF also has a proper spec and an okay library and it is based on bgzf. The library is more complex to use because VCF is more complex.

1 2

Heng Li @lh3lh3.bsky.social · Aug 7

CRAM also ticks some of the points, but the storage cost alone wins users over.

Heng Li @lh3lh3.bsky.social · Aug 7

BAM is more of a literal dump. BCF is not used often because 1) VCF is a small fraction of SAM. Performance is not as critical. 2) Tabix is good enough. 3) Too complex to implement as VCF is not designed with binary in mind. 4) Too late. GATK started to support in 2019. 5) Binary version changes.

3

Heng Li @lh3lh3.bsky.social · Aug 6

I think genomic file formats should be text-first and designed with binary representations in mind

2 4

Heng Li @lh3lh3.bsky.social · Jul 31

In 2020, NCBI considered to remove base quality from free-to-download SRA files. I responded and wrote two blog posts to argue against that. I don't know how much NCBI weighed on everyone's response but they are keeping quality in most SRA files nowadays. grants.nih.gov/grants/guide...

NOT-OD-20-108: Request for Information: Use of Cloud Resources and New File Formats for Sequence Read Archive Data

NIH Funding Opportunities and Notices in the NIH Guide for Grants and Contracts: Request for Information: Use of Cloud Resources and New File Formats for Sequence Read Archive Data NOT-OD-20-108. NIH

grants.nih.gov

4 15

Heng Li @lh3lh3.bsky.social · Jul 31

Longdust, a new tool to identify highly repetitive STRs, VNTRs, satellite DNA and other low-complexity regions (LCRs). Similar to SDUST but for long regions.
github.com/lh3/longdust

GitHub - lh3/longdust: Identify long STRs, VNTRs, satellite DNA and other low-complexity regions in a genome

Identify long STRs, VNTRs, satellite DNA and other low-complexity regions in a genome - lh3/longdust

github.com

28 75

Heng Li @lh3lh3.bsky.social · Jul 8

Preprint on "Finding easy regions for short-read variant calling from pangenome data": arxiv.org/abs/2507.03718

13 31

Heng Li @lh3lh3.bsky.social · Jun 29

Pangolin only supports SNPs and doesn't distinguish donor/acceptor but mutating cTaAt has little effect, too. I don't have alphaGenome numbers. Overall, this comes back to @ewanbirney.bsky.social's question: Is BP essential to splicing? Do these models really see BP?

1

Heng Li @lh3lh3.bsky.social · Jun 29

chr3:143021319:cTaAt matches the yTnAy BP consensus at the 1-based coordinate. The Broad's SpliceAI server doesn't think chr3:143021320:TaA>CaG would lead to an acceptor loss – it's non-essential to splicing. SpliceAI thinks donor might be more affected.

1

Heng Li @lh3lh3.bsky.social · Jun 26

Thanks for the explanation! This is a good example then. It will be interesting to see if pangolin or spliceAI can capture these with ISM. Another question is how often BPs are found across all acceptor sites.

1

Heng Li @lh3lh3.bsky.social · Jun 26

IMHO, it is important to understand what a model learns; otherwise, the model might just captures nuances irrelevant to biology. 4/

1 4