Lightnews — Scholar-powered news

Reposted by Jouni Sirén

Xian Chang @xian-chang.bsky.social · 7d

🦒Long read giraffe is out!🦒
Mapping long reads to pangenome graphs is ~10x faster than with GraphAligner, with veeery slightly better mapping accuracy, short variant calling, and SV genotyping than GraphAligner or Minimap2

bioRxiv Bioinfo @biorxiv-bioinfo.bsky.social · 7d

Rapid, accurate long- and short-read mapping to large pangenome graphs with vg Giraffe https://www.biorxiv.org/content/10.1101/2025.09.29.678807v1

1 22 41

Reposted by Jouni Sirén

Giulio Ermanno Pibiri @jermp.bsky.social · Sep 1

We are glad to announce that the next workshop “Data Structures in Bioinformatics” (DSB 2026) will take place in Venice, Italy, on *February 18-19*, 2026. dsb-meeting.github.io/DSB2026/ Book the dates! #DSB26

DSB 2026 Venice - February 18-19

Workshop Data Structures in Bioinformatics

dsb-meeting.github.io

1 8 14

Jouni Sirén @jltsiren.bsky.social · Aug 28

So maybe we need some kind of stable identifiers (hashes?) for pangenome graphs. And then we need a way of storing graph / parent identifiers in GFA and alignment files. 7/7

Jouni Sirén @jltsiren.bsky.social · Aug 28

We also need a way of specifying the correct reference for reconstructing the reads. That's not as easy with graphs as with linear sequences. For example, if you have aligned the reads to a subgraph (e.g. personalized graph), the supergraph (e.g. clipped graph) is also a valid reference. 6/n

1

Jouni Sirén @jltsiren.bsky.social · Aug 28

While working on GAF-base, I realized that GAF is not the file format I want to use. GAF prioritizes numerical statistics, while the information needed for reconstructing the read and the alignment is optional. In archival and variant calling, it should be the opposite. 5/n

1

Jouni Sirén @jltsiren.bsky.social · Aug 28

When used with GBZ-base, GAF-base allows extracting all reads overlapping with / contained in the subgraph. Queries with 10 kbp subgraphs are effectively instantaneous with short reads, while taking a second or two with long reads. 4/n

1

Jouni Sirén @jltsiren.bsky.social · Aug 28

Recently, I started working on another database: GAF-base. It could be described as "hacky pangenome CRAM in SQLite". GAF-base works, at least with reads aligned with Giraffe, and file sizes are typically somewhere between BAM and CRAM. 3/n

1

Jouni Sirén @jltsiren.bsky.social · Aug 28

It has been useful for investigating various vg issues, and there are also some external users. Sequence Tube Map would be nice application, but we are not there yet. 2/n

GitHub - vgteam/sequenceTubeMap: displays multiple genomic sequences in the form of a tube map

displays multiple genomic sequences in the form of a tube map - vgteam/sequenceTubeMap

github.com

1

Jouni Sirén @jltsiren.bsky.social · Aug 28

GBZ-base has been a side project for me for a couple of years. It's basically a GBZ graph stored in SQLite instead of a custom file format. You can convert a GBZ graph to GBZ-base quickly and then extract subgraphs around nodes / reference positions on a laptop. 1/n

GitHub - jltsiren/gbz-base: Prototype for an immutable pangenome graph in SQLite

Prototype for an immutable pangenome graph in SQLite - jltsiren/gbz-base

github.com

2 2 5

Reposted by Jouni Sirén

Rob Patro @robp.bsky.social · Aug 20

Last talk of the day (before posters) "Lossless Pangenome Indexing Using Tag Arrays" presented by Parsa Eskandar! #WABI25

3 11

Jouni Sirén @jltsiren.bsky.social · Aug 8

Reasoning about maximal paths is difficult, as they are global objects. Giovanni and Travis came up with an equivalent local property related to stable sorting. They called it Wheeler graphs, and that's when theoretical developments took off. 6/6

Jouni Sirén @jltsiren.bsky.social · Aug 8

BOSS was a parallel development for de Bruijn graphs. It inspired me to look into extending the GCSA from DAGs to more general graphs. We ended up with what is now known as Wheeler DFAs, characterized by non-overlapping lexicographic ranges of maximal path labels. 5/6

1

Jouni Sirén @jltsiren.bsky.social · Aug 8

That graph represents recombinations at aligned positions with a unique context after the position. By using a prefix-doubling algorithm, we can instead get a graph that represents recombinations at any aligned position. And that graph was GCSA. 4/6

1

Jouni Sirén @jltsiren.bsky.social · Aug 8

The analysis starts with a multiple sequence alignment and counts the number of runs of aligned suffixes in lexicographic order. If you collapse the runs into nodes, you get a graph that can be indexed with a slight extension of the XBWT. 3/6

1

Jouni Sirén @jltsiren.bsky.social · Aug 8

I contributed to a chapter of graph indexing before/beyond Wheeler graphs in Manzini's Festschrift. I wrote a bit on how the analysis of RLBWT under duplication and edits became GCSA, and how GCSA is related to Wheeler graphs. 2/6

Graph Indexing Beyond Wheeler Graphs

drops.dagstuhl.de

1

Jouni Sirén @jltsiren.bsky.social · Aug 8

There was a workshop on 25 years of the FM-index and the CSA after SEA. I would have liked to attend, but I had other commitments. The invited speakers were Giovanni Manzini and Roberto Grossi, as the other purpose of the workshop was to present them Festschrifts for their 60th birthdays. 1/6

SEA 2025

regindex.github.io

1 5 9

Jouni Sirén @jltsiren.bsky.social · Aug 5

Maybe you can write useful specs after an extensible file format has become popular and evolved over time. But then it will be difficult to convince people to switch to the new format.

1

Jouni Sirén @jltsiren.bsky.social · Aug 5

Well-specified formats would be nice. But it's too likely that the specification is obsolete. Or it lacks necessary features. Or it just doesn't exist, because key people can't agree on the details. (Or all three, as with most pangenome file formats.)

1

Jouni Sirén @jltsiren.bsky.social · Aug 5

As long as we are talking about academic code, relying on someone else's libraries is risky. They are often abandoned as labs lose funding, people move on or leave the academia, and so on. A serious binary format should have independent implementations maintained by different labs.

1 2

Jouni Sirén @jltsiren.bsky.social · Aug 5

I agree in principle. But text files with tab-separated fields are popular for a reason. The format is simple, you can extend it when the requirements change, you can write your own parser if necessary, and you can also use standard tools for many tasks.

2

Jouni Sirén @jltsiren.bsky.social · May 15

Surprisingly, we found that the BWT of 464 HPRC release 2 haplotypes has fewer runs than the BWT of 90 release 1 haplotypes. The quality of the assemblies has clearly improved.

1 1

Jouni Sirén @jltsiren.bsky.social · May 15

This is still a work in progress, but we show that it's feasible to build the data structures for hundreds of human haplotypes, and that reporting the hits is fast for long enough matches.

1

Jouni Sirén @jltsiren.bsky.social · May 15

With tag arrays, we make the FM-index report graph positions instead of sequence positions. Unlike the suffix array, this can be run-length encoded effectively. And then we can quickly report all distinct hits in a BWT interval.

1

Jouni Sirén @jltsiren.bsky.social · May 15

An FM-index on its own reports the same hits in each haplotype. K-mer indexes make a fixed trade-off between sensitivity and specificity. GCSA and Wheeler graphs are often difficult to build for the graph we have.

1 1

Jouni Sirén @jltsiren.bsky.social · May 15

A new preprint on indexing pangenome graphs using an FM-index of the haplotypes and a tag array. Joint work with Parsa Eskandar and @benedictpaten.bsky.social.

Lossless Pangenome Indexing Using Tag Arrays

Pangenome graphs represent the genomic variation by encoding multiple haplotypes within a unified graph structure. However, efficient and lossless indexing of such structures remains challenging due t...

www.biorxiv.org

1 15 36