Lightnews — Scholar-powered news

Reposted by Pierre Peterlongo

Jim Shaw @jimshaw.bsky.social · Sep 7

Preprint out for myloasm, our new nanopore / HiFi metagenome assembler!

Nanopore's getting accurate, but

1. Can this lead to better metagenome assemblies?
2. How, algorithmically, to leverage them?

with co-author Max Marin @mgmarin.bsky.social, supervised by Heng Li @lh3lh3.bsky.social

1 / N

bioRxiv Bioinfo @biorxiv-bioinfo.bsky.social · Sep 7

High-resolution metagenome assembly for modern long reads with myloasm https://www.biorxiv.org/content/10.1101/2025.09.05.674543v1

5 76 110

Pierre Peterlongo @pierrepeterlongo.bsky.social · Sep 4

❗ I clearly consider this result as THE most important result achieved over this last decade for exploiting and democratizing genomic data.
I think there will be a "before" and an "after" logan and logan-search
github.com/IndexThePlan...
logan-search.org
Have a look at this thread

Rayan Chikhi @rayanchikhi.bsky.social · Sep 3

🌎👩‍🔬 For 15+ years biology has accumulated petabytes (million gigabytes) of🧬DNA sequencing data🧬 from the far reaches of our planet.🦠🍄🌵

Logan now democratizes efficient access to the world’s most comprehensive genetics dataset. Free and open.

doi.org/10.1101/2024...

3 9

Pierre Peterlongo @pierrepeterlongo.bsky.social · May 27

🤝 Amazing collaboration with @jermp.bsky.social, @yhhshb.bsky.social, @robp.bsky.social, Victor Levallois, and Bertrand Le Gal, and the help of ‪@yoann.bsky.social‬. 8/8

3

Pierre Peterlongo @pierrepeterlongo.bsky.social · May 27

🌊 On metagenomic data, other tools such as kmindex are good alternatives. At the same time, Kaminari consistently ranks as one of the fastest tools across all data types, generating the smallest indexes (or the lower FPR). 7/8

1 1

Pierre Peterlongo @pierrepeterlongo.bsky.social · May 27

💾 For fixed False Positive rates, it uses up to 37x less space than COBS while being an order of magnitude faster to build and query. 6/8

1 2

Pierre Peterlongo @pierrepeterlongo.bsky.social · May 27

📊 Experimental results show Kaminari's superiority in index size and query performance across various genomic datasets. 5/8

1 1

Pierre Peterlongo @pierrepeterlongo.bsky.social · May 27

🧬 Kaminari's design leverages properties of k-mer minimizers for compact space and fast query time, as inspired by the techniques proposed in Fulgor. 4/8

1 1

Pierre Peterlongo @pierrepeterlongo.bsky.social · May 27

💻 We implemented Kaminari in C++17, available under the MIT license at github.com/yhhshb/kaminari. Additional results and reproducibility info at github.com/vicLeva/benchmarks_kaminari. 3/8

GitHub - yhhshb/kaminari: 雷 - kaminari (thunder/lightning)

雷 - kaminari (thunder/lightning). Contribute to yhhshb/kaminari development by creating an account on GitHub.

github.com

1 1

Pierre Peterlongo @pierrepeterlongo.bsky.social · May 27

🔍 Key findings include:
- Use of minimizers and integer compression for indexing.
- Lower memory footprint and faster query times.
- Minimal impact of false positives on result ranking, using the Rank-Biased Overlap (RBO) metric.
2/8

1 2

Pierre Peterlongo @pierrepeterlongo.bsky.social · May 27

📜 Excited to share insights from our recent paper: "Kaminari: a resource-frugal index for approximate colored k-mer queries". The study aims to efficiently identify documents containing a query string, focusing on DNA strings. www.biorxiv.org/content/10.1... 🧬 🖥️ 1/8

1 16 24

Pierre Peterlongo @pierrepeterlongo.bsky.social · Mar 25

Thanks guys for your precious feedback. I modified the code accordingly.

1

Pierre Peterlongo @pierrepeterlongo.bsky.social · Mar 25

Hi @imartayan.bsky.social I wanted to run distinct-kmers, but I faced limitations as my input data contains non-ACGTacgt characters. Thus I created this github.com/pierrepeterl...
(again extremely simple)

GitHub - pierrepeterlongo/hyperloglog_kmer_counter

Contribute to pierrepeterlongo/hyperloglog_kmer_counter development by creating an account on GitHub.

github.com

1 2

Pierre Peterlongo @pierrepeterlongo.bsky.social · Mar 25

That's correct.
I just created this github.com/pierrepeterl... This is yet a new hll kmer counter, but hyper simple. And I did not find a way to accumulate the kmer counts for several input datasets.

GitHub - pierrepeterlongo/hyperloglog_kmer_counter

Contribute to pierrepeterlongo/hyperloglog_kmer_counter development by creating an account on GitHub.

github.com

1

Pierre Peterlongo @pierrepeterlongo.bsky.social · Mar 24

@imartayan.bsky.social I needed a version of distinct_kmers for multiple fasta/fastq.
I created this fork github.com/pierrepeterl...
I'm almost ashamed that this code modification is public, but maybe it can be useful.

GitHub - pierrepeterlongo/distinct-kmers: How many distinct k-mers are there in a sequence?

How many distinct k-mers are there in a sequence? Contribute to pierrepeterlongo/distinct-kmers development by creating an account on GitHub.

github.com

1 1

Pierre Peterlongo @pierrepeterlongo.bsky.social · Mar 20

I added the notion of insertion order (mentioning your name). However, I don't get the point of the mergeability issue.

1

Pierre Peterlongo @pierrepeterlongo.bsky.social · Mar 20

Note that the "conservative update" is also something we implemented (without describing it) in fimpera github.com/lrobidou/fim...

1 1

Pierre Peterlongo @pierrepeterlongo.bsky.social · Mar 20

Thanks again for this pointer @benlangmead.bsky.social. What I described is the same idea, adapted when items are added on the fly, without their final abundance.
The technique in the "conservative update" is adapted when items are added simultaneously with their abundance.

1 1

Pierre Peterlongo @pierrepeterlongo.bsky.social · Mar 18

HO! amazing results. The difference between you and a rust beginner.
You'll try to understand your code.

1

Pierre Peterlongo @pierrepeterlongo.bsky.social · Mar 18

Thanks Ben - I'll at this.

1 2

Pierre Peterlongo @pierrepeterlongo.bsky.social · Mar 18

Results: slightly longer insertion time, but 2 to 3 times lower abundance overestimations.

1 2

Pierre Peterlongo @pierrepeterlongo.bsky.social · Mar 18

In two words: increase only minimal stored values of a cBF when adding elements to this filter.

1 1

Pierre Peterlongo @pierrepeterlongo.bsky.social · Mar 18

Maybe the simplest idea to decrease overestimations of a counting bloom filter. A trivial observation + 10 lines of code.
I'm surprised it has not been described before. Please comment if this is not the case.
Blog post here:
pierrepeterlongo.github.io/2025/03/17/m... 🧪🧬🖥️

2 2 7

Pierre Peterlongo @pierrepeterlongo.bsky.social · Jan 30

Yes ntCard helps a lot and its precision is impressive on reads. Indeed I wanted exact number on genome.

1

Pierre Peterlongo @pierrepeterlongo.bsky.social · Jan 30

I wanted something that used as little memory as possible. I don't want to count kmers, but only know the number of unique kmers. So jellyfish, KMC, ... are too advanced for this simple task.

3

Pierre Peterlongo @pierrepeterlongo.bsky.social · Jan 30

Today I wanted to know the number of unique 27-mers in the hg38 human genome (spoiler there are 2.49 billion). I found no tool for doing this. So I wrote that github.com/pierrepeterl...

It may help.
Please use it / improve it.

🧬💻 #bioinformatics

GitHub - pierrepeterlongo/unique_kmer_counter: Count number of unique kmers from fasta or fasta.gz files

Count number of unique kmers from fasta or fasta.gz files - pierrepeterlongo/unique_kmer_counter

github.com

3 3 16