Jouni Sirén
@jltsiren.bsky.social
120 followers 67 following 28 posts
Researcher at UCSC Genomics Institute. Space-efficient data structures and pangenome graphs.
Posts Media Videos Starter Packs
Reposted by Jouni Sirén
xian-chang.bsky.social
🦒Long read giraffe is out!🦒
Mapping long reads to pangenome graphs is ~10x faster than with GraphAligner, with veeery slightly better mapping accuracy, short variant calling, and SV genotyping than GraphAligner or Minimap2
biorxiv-bioinfo.bsky.social
Rapid, accurate long- and short-read mapping to large pangenome graphs with vg Giraffe https://www.biorxiv.org/content/10.1101/2025.09.29.678807v1
Reposted by Jouni Sirén
jermp.bsky.social
We are glad to announce that the next workshop “Data Structures in Bioinformatics” (DSB 2026) will take place in Venice, Italy, on *February 18-19*, 2026. dsb-meeting.github.io/DSB2026/ Book the dates! #DSB26
DSB 2026 Venice - February 18-19
Workshop Data Structures in Bioinformatics
dsb-meeting.github.io
jltsiren.bsky.social
So maybe we need some kind of stable identifiers (hashes?) for pangenome graphs. And then we need a way of storing graph / parent identifiers in GFA and alignment files. 7/7
jltsiren.bsky.social
We also need a way of specifying the correct reference for reconstructing the reads. That's not as easy with graphs as with linear sequences. For example, if you have aligned the reads to a subgraph (e.g. personalized graph), the supergraph (e.g. clipped graph) is also a valid reference. 6/n
jltsiren.bsky.social
While working on GAF-base, I realized that GAF is not the file format I want to use. GAF prioritizes numerical statistics, while the information needed for reconstructing the read and the alignment is optional. In archival and variant calling, it should be the opposite. 5/n
jltsiren.bsky.social
When used with GBZ-base, GAF-base allows extracting all reads overlapping with / contained in the subgraph. Queries with 10 kbp subgraphs are effectively instantaneous with short reads, while taking a second or two with long reads. 4/n
jltsiren.bsky.social
Recently, I started working on another database: GAF-base. It could be described as "hacky pangenome CRAM in SQLite". GAF-base works, at least with reads aligned with Giraffe, and file sizes are typically somewhere between BAM and CRAM. 3/n
jltsiren.bsky.social
It has been useful for investigating various vg issues, and there are also some external users. Sequence Tube Map would be nice application, but we are not there yet. 2/n
GitHub - vgteam/sequenceTubeMap: displays multiple genomic sequences in the form of a tube map
displays multiple genomic sequences in the form of a tube map - vgteam/sequenceTubeMap
github.com
jltsiren.bsky.social
GBZ-base has been a side project for me for a couple of years. It's basically a GBZ graph stored in SQLite instead of a custom file format. You can convert a GBZ graph to GBZ-base quickly and then extract subgraphs around nodes / reference positions on a laptop. 1/n
GitHub - jltsiren/gbz-base: Prototype for an immutable pangenome graph in SQLite
Prototype for an immutable pangenome graph in SQLite - jltsiren/gbz-base
github.com
Reposted by Jouni Sirén
robp.bsky.social
Last talk of the day (before posters) "Lossless Pangenome Indexing Using Tag Arrays" presented by Parsa Eskandar! #WABI25
jltsiren.bsky.social
Reasoning about maximal paths is difficult, as they are global objects. Giovanni and Travis came up with an equivalent local property related to stable sorting. They called it Wheeler graphs, and that's when theoretical developments took off. 6/6
jltsiren.bsky.social
BOSS was a parallel development for de Bruijn graphs. It inspired me to look into extending the GCSA from DAGs to more general graphs. We ended up with what is now known as Wheeler DFAs, characterized by non-overlapping lexicographic ranges of maximal path labels. 5/6
jltsiren.bsky.social
That graph represents recombinations at aligned positions with a unique context after the position. By using a prefix-doubling algorithm, we can instead get a graph that represents recombinations at any aligned position. And that graph was GCSA. 4/6
jltsiren.bsky.social
The analysis starts with a multiple sequence alignment and counts the number of runs of aligned suffixes in lexicographic order. If you collapse the runs into nodes, you get a graph that can be indexed with a slight extension of the XBWT. 3/6
jltsiren.bsky.social
I contributed to a chapter of graph indexing before/beyond Wheeler graphs in Manzini's Festschrift. I wrote a bit on how the analysis of RLBWT under duplication and edits became GCSA, and how GCSA is related to Wheeler graphs. 2/6
Graph Indexing Beyond Wheeler Graphs
drops.dagstuhl.de
jltsiren.bsky.social
There was a workshop on 25 years of the FM-index and the CSA after SEA. I would have liked to attend, but I had other commitments. The invited speakers were Giovanni Manzini and Roberto Grossi, as the other purpose of the workshop was to present them Festschrifts for their 60th birthdays. 1/6
SEA 2025
regindex.github.io
jltsiren.bsky.social
Maybe you can write useful specs after an extensible file format has become popular and evolved over time. But then it will be difficult to convince people to switch to the new format.
jltsiren.bsky.social
Well-specified formats would be nice. But it's too likely that the specification is obsolete. Or it lacks necessary features. Or it just doesn't exist, because key people can't agree on the details. (Or all three, as with most pangenome file formats.)
jltsiren.bsky.social
As long as we are talking about academic code, relying on someone else's libraries is risky. They are often abandoned as labs lose funding, people move on or leave the academia, and so on. A serious binary format should have independent implementations maintained by different labs.
jltsiren.bsky.social
I agree in principle. But text files with tab-separated fields are popular for a reason. The format is simple, you can extend it when the requirements change, you can write your own parser if necessary, and you can also use standard tools for many tasks.
jltsiren.bsky.social
Surprisingly, we found that the BWT of 464 HPRC release 2 haplotypes has fewer runs than the BWT of 90 release 1 haplotypes. The quality of the assemblies has clearly improved.
jltsiren.bsky.social
This is still a work in progress, but we show that it's feasible to build the data structures for hundreds of human haplotypes, and that reporting the hits is fast for long enough matches.
jltsiren.bsky.social
With tag arrays, we make the FM-index report graph positions instead of sequence positions. Unlike the suffix array, this can be run-length encoded effectively. And then we can quickly report all distinct hits in a BWT interval.
jltsiren.bsky.social
An FM-index on its own reports the same hits in each haplotype. K-mer indexes make a fixed trade-off between sensitivity and specificity. GCSA and Wheeler graphs are often difficult to build for the graph we have.