Bede Constantinides
@bedec.bsky.social
900 followers 1.4K following 110 posts
Interested in infectious disease informatics. Research fellow at the University of Birmingham. Also cycling, photography, active travel. https://bede.im
Posts Media Videos Starter Packs
Pinned
bedec.bsky.social
New preprint! Deacon is a versatile tool for filtering FASTA/FASTQ files and streams at hundreds of megabases per second using minimizers, built with rapid metagenomic host depletion in mind, but equally useful for search.
github.com/bede/deacon
Reposted by Bede Constantinides
benlangmead.bsky.social
I've added 7 videos to my Burrows-Wheeler indexing playlist (www.youtube.com/playlist?lis...), rounding out the r-index series and adding a 5-part series on the move structure. Now 27 videos in that playlist. I aim to add videos on prefix-free parsing, PBWT, Wheeler languages/automata in the future.
Burrows-Wheeler Indexing - YouTube
Videos on : (a) the Burrows-Wheeler Transform (BWT), (b) the FM Index, which uses the BWT to construct a full-text index, (c) Wheeler graphs, (d) r-index, an...
www.youtube.com
bedec.bsky.social
Yes I got the upload pings and almost posted the same here. A treasure trove of material.
bedec.bsky.social
No harm in sticking with an old version :-) Results should be identical to 0.10.0 except for seqs containing IUPAC ambiguous bases, changing the classification of exactly one read in my IUPAC benchmark dataset.
bedec.bsky.social
Deacon 0.11.0:
- Local server mode
- Ultra-careful handling of non-ACGT
- Faster indexing & index loading
- Denser index now stores k-mers not hashes
- xxHash & FxHash replaced with rapidhash::fast
- Bug fixes

Thanks @curiouscoding.nl (and others!) for contributions
github.com/bede/deacon/...
Release 0.11.0 · bede/deacon
Major release incorporating new features, fixes and peformance optimisations. Includes many PRs from @RagnarGrootKoerkamp, taking advantage of new features in simd-minimizers, packed-seq and parase...
github.com
bedec.bsky.social
I had a quick look but was intimidated by the docs honestly. Nick from the Zstd team mentioned BAMs wrt a forthcoming post on use cases, so hopefully we'll soon have some diverse profiles to learn from. news.ycombinator.com/item?id=4549...
Do you happen to have a pointer to a good open source dataset to look at? Naivel... | Hacker News
news.ycombinator.com
bedec.bsky.social
Nick Terrell (Meta): "Looking at the BAM format, it looks like the tokenization portion will be easy. Which means I can focus on the compression side, which is more interesting."
bedec.bsky.social
It doesn't compress e.g. FASTA out of the box currently. The bundled profiles are geared towards sensible fixed-length formats like Parquet. I imagine we'll see new profiles emerge quickly openzl.org/getting-star...
Quick Start - OpenZL
openzl.org
bedec.bsky.social
"OpenZL is our answer to the tension between the performance of format-specific compressors and the maintenance simplicity of a single executable binary."
engineering.fb.com/2025/10/06/d...
bedec.bsky.social
Oh it's a pity if splitting into crates is punished with much worse compile times. Is this the case?
bedec.bsky.social
Fun question! Doubt it would noticeably change the plot though. Gzip's 32KB window is a tiny fraction of genome length for anything larger than a virus. Would compress low complexity regions well but miss e.g. gene duplications.
bedec.bsky.social
I can't bring myself to bet against HashSet given recent experience
Reposted by Bede Constantinides
curiouscoding.nl
Looking for people to test the latest version of simd-sketch.

It's now 2x as fast at sketching, and supports skipping over kmers containing N and other ambiguous bases (which is only ~35% slower).

'cargo install simd-sketch' is right there under your fingertips ;)

github.com/RagnarGrootK...
GitHub - RagnarGrootKoerkamp/simd-sketch: Compute bottom-s sketches and s-buckets sketches, using simd-minimizers crate.
Compute bottom-s sketches and s-buckets sketches, using simd-minimizers crate. - RagnarGrootKoerkamp/simd-sketch
github.com
Reposted by Bede Constantinides
curiouscoding.nl
FxHashSet::<u32>::contains throughput is wild!

- Up to 4x slowdown for negative queries due to probing.
- Positive queries are fast for small tables, but slow in RAM because they need 2 cache misses.

Lots of variance depending on the load factor, ie whether n is close to 87.5% of a power of 2.
Plot of (inverse) throughput of querying an FxHashSet<u32> of increasing size. 3 lines show the throughput when 1%, 50%, or 99% of queries is present in the set. The 1% and 50% lines show big spikes just before every power of 2, where they are up to 3x slower than in the best case.
Reposted by Bede Constantinides
dotnagy.bsky.social
Pleased to see this pre-printed, highlighting the completeness/accuracy of @nanoporetech.com long-read genome assembly for clinical Enterobacterales: www.biorxiv.org/content/10.1...

Thanks to colleagues @modmedmicro.bsky.social, @ukhsa.bsky.social, @genewiz.bsky.social and @oxfordbrc.bsky.social!
Reposted by Bede Constantinides
pathogenomenick.bsky.social
Terrific new feature presented by @theo.io on @pathoplexus.org called SeqSets for generating DOIs for sequence subsets used in publications, that can then be tracked for impact via CrossRef that will allow data generators to track impact! #IMMEMXiV
bedec.bsky.social
Thanks! Had assumed there'd be a good reason…
bedec.bsky.social
Oh I didn't know 2bit supported Ns through extra data blocks. I wonder how easily it can be parsed in parallel compared to Binseq – perhaps @noamteyssier.bsky.social can comment?
bedec.bsky.social
Some applications do need that 5th symbol to represent ambiguity / N, which I assume is why BAM uses 4bit encoding (IIRC). 2bit + bit mask(s) is another way to do it.
bedec.bsky.social
Predictably the comments section is mostly horror at the state of FASTA format
bedec.bsky.social
Thanks for sharing – another related tip is to increase the window size for even higher CR bsky.app/profile/bede...
bedec.bsky.social
Blogged about how zstd --long fills the gap between fast and slow-but-high-ratio genome compression methods log.bede.im/2025/09/12/z...
bedec.bsky.social
Blogged about how zstd --long fills the gap between fast and slow-but-high-ratio genome compression methods log.bede.im/2025/09/12/z...