Pierre Marijon
banner
pierre-marijon.bsky.social
Pierre Marijon
@pierre-marijon.bsky.social
Research engineer, #bioinformatics, #genomic, #assembly, #variantcalling he|him #BiInSci #dyslexic #disabled

https://pierre.marijon.fr/link.html
And Yes definitively don't use default hasher unless you realy need hash DDOS safety.
March 25, 2025 at 12:33 PM
I slowly move closer and whisper:
- niffler

crates.io/crates/niffler
crates.io: Rust Package Registry
crates.io
March 25, 2025 at 12:32 PM
Maybe I could try to improve this part with some limitation (no compression, no multiline fasta) by use memmap.

Similar to github.com/natir/biommap
GitHub - natir/biommap: A vcf parser that use memory mapping to get high performance.
A vcf parser that use memory mapping to get high performance. - natir/biommap
github.com
March 19, 2025 at 8:19 AM
If you accepte to have a huge memory usage, only odd k and canonical kmer a code similar to pcon could be nice.

github.com/natir/pcon
GitHub - natir/pcon: Prompt COuNter
Prompt COuNter. Contribute to natir/pcon development by creating an account on GitHub.
github.com
March 19, 2025 at 8:13 AM
Maybe I miss it but your code didn't manage forward and reverse kmer ?

smidminimizer did this job ?
March 19, 2025 at 8:08 AM
For me yes it's seems to be an augmented tree.

Large query, mean longer intervals.

If you are interest, before I discover COITree I rewrite cgranges and Interpolate Index cgranges in rust github.com/natir/clairi...
GitHub - natir/clairiere: A rust implementation of implicit interval tree with interpolation index.
A rust implementation of implicit interval tree with interpolation index. - natir/clairiere
github.com
January 10, 2025 at 4:10 PM
The set of intervals in which we search is effectively static.

Generally the size of the queries are between 1 to 50 many small queries and sorted.
But it would be usefull to be able to query with much larger queries, 50M, even if it would cost much more there are very few of them.
January 10, 2025 at 3:50 PM
Ok great I wasn't crazy seeing similarities!

From my understanding it doesn't look more like a binary search tree with a pair of position as node. (cgranges article is very good).

If we're interested on task of annotating a set of variants for a set of genome annotations (gene position).
January 10, 2025 at 3:50 PM
Huge work! I drift on one of my current interests, these works could be applied to research in an interval tree? Found genomic annotations that share an intersection with variant.

COITrees (github.com/dcjones/coit...) used cache-aware binary tree search that I thought of when reading your post.
GitHub - dcjones/coitrees: A very fast interval tree data structure
A very fast interval tree data structure. Contribute to dcjones/coitrees development by creating an account on GitHub.
github.com
January 10, 2025 at 2:23 PM
I say almost same thing each time I teach Burrows–Wheeler Transform.
December 16, 2024 at 1:54 PM