Lightnews — Scholar-powered news

Reposted by Sina Majidian

JHU Computer Science @jhucompsci.bsky.social · 46m

10 new CS professors! 🥳

@anandbhattad.bsky.social @uthsav.bsky.social @gligoric.bsky.social @murat-kocaoglu.bsky.social @tiziano.bsky.social

hopkinsdsai.bsky.social @hopkinsdsai.bsky.social · Aug 25

#HopkinsDSAI welcomes 22 new faculty members, who join more than 150 DSAI faculty members across @jhu.edu in advancing the study of data science, machine learning, and #AI and translation to a range of critical and emerging fields.

ai.jhu.edu/news/data-sc...

2 1

Reposted by Sina Majidian

Rob Patro @robp.bsky.social · 2h

Have you recently completed (or finishing soon) a PhD in CS or a related discipline? Do you want to do research advancing the theory & practice of algorithmic genomics & build tools that people love to use? I'll be looking to hire a postdoc! Official ad coming soon:
docs.google.com/document/d/1...

Postdoc Description.docx

Title: Postdoctoral Associate Summary statement: The postdoctoral research associate is responsible for developing novel computational methodology for high-throughput sequence genomics tasks, as well ...

docs.google.com

7 4

Sina Majidian @sinamajidian.bsky.social · 21h

Genomics in Context Awards: collaborative research at the intersection of genomics, humanities, social sciences and bioethics
wellcome.org/research-fun...
Teams must include >1 researcher from life sciences
& >1 researcher from humanities, social sciences and bioethics

Genomics in Context Awards - Research Funding | Wellcome

These awards will support transdisciplinary teams to catalyse research discoveries at the intersection of genomics, humanities, social sciences and bioethics.

wellcome.org

1

Reposted by Sina Majidian

Ben Langmead @benlangmead.bsky.social · 1d

I've added 7 videos to my Burrows-Wheeler indexing playlist (www.youtube.com/playlist?lis...), rounding out the r-index series and adding a 5-part series on the move structure. Now 27 videos in that playlist. I aim to add videos on prefix-free parsing, PBWT, Wheeler languages/automata in the future.

Burrows-Wheeler Indexing - YouTube

Videos on : (a) the Burrows-Wheeler Transform (BWT), (b) the FM Index, which uses the BWT to construct a full-text index, (c) Wheeler graphs, (d) r-index, an...

www.youtube.com

2 16 55

Reposted by Sina Majidian

Rosa Fernández @rosafernandez.bsky.social · 1d

How did animals repeatedly conquer land? 🌊➡️⛰️ We analysed ~1,000 gene repertoires (24M genes!) from all animal phyla to uncover how this happened. Work led by @gemmaeling.bsky.social & Klara Eleftheriadi, both first coauthors of this titanic effort!
www.biorxiv.org/content/10.1...

Independent genomic trajectories shape adaptation to life on land across animal lineages

How animals repeatedly adapted to life on land is a central question in evolutionary biology. While terrestrialisation occurred independently across animal phyla, it remains unclear whether shared gen...

www.biorxiv.org

1 13 31

Reposted by Sina Majidian

AnniZLab: [🦠, 🧬 , ✨] @annizlab.bsky.social · Aug 30

Our new tool "X-Mapper: fast and accurate sequence alignment via gapped x-mers" now published on Genome Biology! Please try it if you work on DNA sequences :) github.com/mathjeff/Map...
genomebiology.biomedcentral.com/articles/10....

X-Mapper: fast and accurate sequence alignment via gapped x-mers - Genome Biology

Sequence alignment is foundational to many bioinformatic analyses. Many aligners start by splitting sequences into contiguous, fixed-length seeds, called k-mers. Alignment is faster with longer, uniqu...

genomebiology.biomedcentral.com

1 4 5

Reposted by Sina Majidian

Paul Carini @uncultured.carinilab.com · 15d

What are folks using for calling genes these days in isolate genomes: PGAP, Bakta, or Prokka? This is for a 70% GC genome of a very novel lineage.

7 7 10

Sina Majidian @sinamajidian.bsky.social · 4d

Advances in haplotype phasing and genotype imputation
Quan Sun & Yun Li
Nature Reviews Genetics 2025
www.nature.com/articles/s41...

a, A conceptual illustration of phasing. After read alignment with reference genome, we can infer or call genotypes of target individuals, but phase information (that is, information about which alleles are inherited together on the same parental chromosome) is unknown. Phasing is the process to make such inference starting from unphased genotype data. b, A conceptual illustration of imputation from array genotype data. Imputation is the process to infer genotypes at untyped markers with the aid of reference panels. Heuristically, it identifies haplotype segments in reference panels that match genotypes at typed markers for imputation of target individuals and then imputes by simply copying over the shared segments. In the right panel (after imputation), imputed genotypes at untyped markers for the target sample are denoted with lower-case letters, with the colour representing the corresponding reference haplotype from which the alleles are copied. c, A timeline of recent major developments in phasing and imputation, which begins from the introduction of positional Burrows–Wheeler transform (PBWT), a highly efficient method for haplotype representation that paved the road for more recent phasing and imputation methods focusing on computational improvements. A timeline of earlier evolvement (before 2018) is detailed in ref. 77. lcWGS, low-coverage whole genome sequencing; LRS, long-read sequencing.

4

Reposted by Sina Majidian

Ana Conesa @anaconesa.bsky.social · 6d

Looking for scientists working with long-read transcriptomics technologies to join a COST action proposal. Contact us!!! @nanoporetech.com @pacbio.bsky.social

6 6

Sina Majidian @sinamajidian.bsky.social · 6d

CADD: predicting the deleteriousness of variants throughout the human genome, 2019, NAR
doi.org/10.1093/nar/...

CADD v1.7, 2024, NAR
doi.org/10.1093/nar/...

Figure 1. The CADD framework. (A) Training a CADD model requires the identification of variants that are fixed or nearly fixed in human populations, but are absent in the inferred genome sequence of the human-ape ancestor (proxy-neutral variants). The sequence composition of this variant set is used to draw a matching set of proxy-deleterious variants. Using more than 60 diverse annotations, a machine learning model is trained to classify variants as proxy-neutral versus proxy-deleterious. All potential SNVs of the human reference genome are annotated using the same features, and raw CADD scores are calculated. A PHRED conversion table is derived from the relative ranking of these model scores. (B) Users provide variant sets in VCF, and CADD uses the chromosome, position, reference allele and alternative allele columns from these files. Scores are either retrieved from pre-scored files, or else variants are fully annotated and the CADD score is calculated. The PHRED-scaled score is then looked up in the conversion table, and both scores returned to the user. Users may request output files containing variant annotations.

2

Reposted by Sina Majidian

Xian Chang @xian-chang.bsky.social · 6d

🦒Long read giraffe is out!🦒
Mapping long reads to pangenome graphs is ~10x faster than with GraphAligner, with veeery slightly better mapping accuracy, short variant calling, and SV genotyping than GraphAligner or Minimap2

bioRxiv Bioinfo @biorxiv-bioinfo.bsky.social · 6d

Rapid, accurate long- and short-read mapping to large pangenome graphs with vg Giraffe https://www.biorxiv.org/content/10.1101/2025.09.29.678807v1

1 22 41

Reposted by Sina Majidian

Tami Lieberman @contaminatedsci.bsky.social · 8d

Precisely calling mutations across hundreds of bacterial isolates has been hard, requiring manual filtering and expertise.

Until now, using AccuSNV.

Herui Liao trained an ML model based on our previous meticulously called SNVs.
www.biorxiv.org/content/10.1...

High-accuracy SNV calling for bacterial isolates using deep learning with AccuSNV

Accurate detection of mutations within bacterial species is critical for fundamental studies of microbial evolution, reconstructing transmission events, and identifying antimicrobial resistance mutati...

www.biorxiv.org

2 33 70

Reposted by Sina Majidian

Marnix Medema @marnixmedema.bsky.social · 9d

Very important initiative! This could really help facilitate increasing data sharing as well as appropriate attribution of data creation.

Alex Probst @alexjprobst.bsky.social · 11d

New article on equitable reuse of public sequencing data, published in @natmicrobiol.nature.com!
Led by the Data reuse core team @lhug.bsky.social @environmicrobio.bsky.social Cristina Moraru, @geomicrosoares.bsky.social, @folker.bsky.social and with Anke Heyer and The Data Reuse Consotrium!

1

Reposted by Sina Majidian

👻Sewall Fright🎃 @stairwaytokevin.bsky.social · 10d

Whole-genome alignments revealed pennycress has nearly dichotomous genome compartmentalization: huge gene-poor pericentromeric regions (~300Mb; <1% genic) with frequent rearrangements and highly syntenic gene-rich chromosome arms (~150Mb; ~20% genic). What we call a "two-speed" genome structure. 3/

Figure 3 | Macrosynteny and genome structure across the Brassicaceae. Horizontal blue/black/orange bands represent the chromosomes of Arabidopsis thaliana, A. lyrata, MN106, and Brassica rapa (top to bottom). Chromosomes are ordered by their number from left to right. Colors represent genomic content binned hierarchically in sliding windows (400kb-overlapping 500kb) as follow: (1) within a gene annotation (including intron and UTR, orange), (2) within EDTA-annotated repeats categorized as Ty3, (3) Ty1 (copia), (4) within another repeat category, or (5) un-annotated. Grey bands are sequence-based syntenic blocks between each pair of genomes. Pennycress and B. rapa are phylogenetically proximate (both in Brassicodae supertribe), but have reduced synteny in part because of genome reshuffling in B. rapa following a whole-genome triplication event. The seven pennycress genome assemblies (horizontal bars) are binned into TRASH-defined centromeres (orange), pericentromeres (dark blue), chromosome arms (light blue) and telomeres (dark red). The colors along the chromosome segments scale physically with the size of the bin, except that centromeres and telomeres have a 1pt buffer to make it easier to see these typically small regions. Each genome is connected to its neighbor by grey polygons that represent sequence-based syntenic blocks. Plots, genomic bins, and syntenic blocks were built with DEEPSPACE (github.com/jtlovell/DEEPSPACE).

1 6 11

Reposted by Sina Majidian

Dorottya Nagy @dotnagy.bsky.social · 13d

Pleased to see this pre-printed, highlighting the completeness/accuracy of @nanoporetech.com long-read genome assembly for clinical Enterobacterales: www.biorxiv.org/content/10.1...

Thanks to colleagues @modmedmicro.bsky.social, @ukhsa.bsky.social, @genewiz.bsky.social and @oxfordbrc.bsky.social!

1 12 10

Reposted by Sina Majidian

RECOMB Conference Series @recombconf.bsky.social · 12d

#RECOMB2026 will be in Thessaloniki, Greece on May 26-29, 2026. Satellites on May 24-25. Save the date!

Το συνέδριο #RECOMB2026 θα πραγματοποιηθεί στη Θεσσαλονίκη, στις 26-29 Μαΐου 2026. Οι δορυφορικές εκδηλώσεις θα διεξαχθούν στις 24-25 Μαΐου 2026. Σημειώστε την ημερομηνία!

13 20

Sina Majidian @sinamajidian.bsky.social · 13d

NCBI Orthologs
link.springer.com/article/10.1...
Journal of Molecular Evolution
Special Issue: Quest for Orthologs

NCBI Orthologs: Public Resource and Scalable Method for Computing High-Precision Orthologs Across Eukaryotic Genomes - Journal of Molecular Evolution

Orthologs are fundamental for enabling comparative genomics analyses that further our understanding of eukaryotic biology. The unprecedented increase in the availability of high-quality eukaryotic genomes necessitates scalable and accurate methods for orthology inference. The National Center for Biotechnology Information (NCBI) developed “NCBI Orthologs”, a resource and a computational pipeline designed to meet this challenge within the NCBI RefSeq framework. This system integrates protein similarity, nucleotide alignment, and microsynteny to achieve high-precision ortholog assignments across diverse eukaryotes. The pipeline leverages high-quality RefSeq annotations and processes genomes individually, ensuring scalability. Resulting ortholog data, organized into gene-level anchored sets, enables propagation of functional annotation information and facilitates comparative genomics. Critically, these data are integrated into the NCBI Gene resource, providing users with access from various entry points. The NCBI Datasets resource provides an intuitive interface to explore orthologous relationships on the web and allows bulk data download via the web, command-line tools, and an API. We detail the methodology, including anchor species selection and the decision tree used to arrive at high-confidence one-to-one orthology relationships. NCBI Orthologs is a valuable resource for facilitating functional annotation efforts and enhancing our understanding of eukaryotic gene evolution.

link.springer.com

4

Reposted by Sina Majidian

Michael Love @mikelove.bsky.social · 13d

Review article from Quan Sun and Yun Li at UNC Genetics and Biostatistics

Nature Reviews Genetics @natrevgenet.nature.com · 14d

New online! Advances in haplotype phasing and genotype imputation

Advances in haplotype phasing and genotype imputation

Nature Reviews Genetics, Published online: 24 September 2025; doi:10.1038/s41576-025-00895-2Haplotype phasing and genotype imputation improve genomic analyses by determining which variants occur together on a chromosome and inferring unobserved varants, respectively. In this Review, Sun and Li describe how tools for haplotype phasing and genotype imputation have evolved to accommodate increasingly larger genomic datasets and new sequencing technologies.

www.nature.com

2 10

Reposted by Sina Majidian

Arnau Sebé-Pedrós @arnausebe.bsky.social · 14d

Happy to share the Biodiversity Cell Atlas white paper, out today in @nature.com. We look at the possibilities, challenges, and potential impacts of molecularly mapping cells across the tree of life.
www.nature.com/articles/s41...

2 110 220

Sina Majidian @sinamajidian.bsky.social · 14d

Also confirmed by another study:
"This suggests that even random DNA sequences can provide enough structure for models to learn generalizable biological signals, consistent with the performance of randomly pre-trained models reported by Zhang et al."
www.biorxiv.org/content/10.1...

Interpreting Attention Mechanisms in Genomic Transformer Models: A Framework for Biological Insights

Transformer models have shown strong performance on biological sequence prediction tasks, but the interpretability of their internal mechanisms remains underexplored. Given their application in biomed...

www.biorxiv.org

Sina Majidian @sinamajidian.bsky.social · 14d

Very surprising to me: "...using the k-mer embeddings pre-trained on random data can yield similar performance in downstream tasks, when compared with those using the k-mer embeddings pre-trained on real biological sequences. "
academic.oup.com/bioinformati...

Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings

AbstractMotivation. In recent years, pre-training with the transformer architecture has gained significant attention. While this approach has led to notabl

academic.oup.com

2 7

Reposted by Sina Majidian

Adam Phillippy @aphillippy.bsky.social · 16d

Delighted to finally announce a preprint describing the Q100 project! “A complete diploid human genome benchmark for personalized genomics” For which we finished HG002 to near-perfect accuracy: www.biorxiv.org/content/10.1... 🧵[1/14]

A complete diploid human genome benchmark for personalized genomics

Human genome resequencing typically involves mapping reads to a reference genome to call variants; however, this approach suffers from both technical and reference biases, leaving many duplicated and ...

www.biorxiv.org

4 57 96

Reposted by Sina Majidian

Marnix Medema @marnixmedema.bsky.social · 16d

New preprint out by #RobertKoetsier, the first of his PhD project, on assessing the use of cross-species coexpression analysis to identify primary and secondary metabolic interactions in microbiomes: www.biorxiv.org/content/10.1...

Using cross-species co-expression to predict metabolic interactions in microbiomes

In microbial ecosystems, metabolic interactions are key determinants of species’ relative abundance and activity. Given the immense number of possible interactions in microbial communities, their expe...

www.biorxiv.org

1 13 24

Sina Majidian @sinamajidian.bsky.social · 16d

oh sorry, that's right, thanks for your interest!

Sina Majidian @sinamajidian.bsky.social · 17d

EvANI benchmarking workflow for evolutionary distance estimation academic.oup.com/bib/article/...

An great teamwork by @mohsenzakeri.bsky.social, @stephenhwang.bsky.social and me, with the excellent mentorship of @benlangmead.bsky.social

EvANI benchmarking workflow for evolutionary distance estimation

Abstract. Advances in long-read sequencing technology have led to a rapid increase in high-quality genome assemblies. These make it possible to compare gen

academic.oup.com

1 3 11