Arun Das
@arun-das.bsky.social
65 followers 80 following 32 posts
Postdoc at @genomescience.bsky.social‬. Scientist working in computer science and genomics. arundas.org PhD from Schatz Lab @ JHU. Previously: CS @ Brown '18. He/His/Him. #YNWA 🍉
Posts Media Videos Starter Packs
Pinned
arun-das.bsky.social
Our pre-print on investigating variation in South Asian genomes is now out!

Thank you to @mikeschatz.bsky.social, @rajivmccoy.bsky.social and @aabiddanda.bsky.social for all their work on this.

🧵 A thread on the key results and takeaways from our work:
biorxiv-genomic.bsky.social
Assembling unmapped reads reveals hidden variation in South Asian genomes https://www.biorxiv.org/content/10.1101/2025.05.14.653340v1
arun-das.bsky.social
I just cannot see academic institutions, medical facilities and so many other employers stumping up $100K.

It’s disheartening, both to see this happen and to see how few people who rely on and employ individuals on H-1B visas are willing to speak up and address this. Your silence is deafening.
arun-das.bsky.social
Make no mistake, this remains catastrophic.

It makes it near impossible for people to stay and work in the US, no matter how qualified they are, unless they work for a handful of extremely wealthy companies.

For so many of us, it’s time to make other plans.
arun-das.bsky.social
Small victories, but this doesn’t seem to apply to those currently on an H-1B visa.

Wish it was made clear in the initial “proclamation”, before we spent the entire day panicking while trying to figure out a way to get a friend back to the US before midnight.
arun-das.bsky.social
This is catastrophic.

So, so many people I know and love are going to find it impossible to stay and work in the US, and it makes it almost impossible for people like me to stay and work here in the longer term, no matter how qualified we are.
Reposted by Arun Das
New blog post – A quick look at Roche's SBX
lh3.github.io/2025/09/11/a...
arun-das.bsky.social
The concessions made by Brown endanger so many members of our community on campus, limit the access to higher education for individuals from underrepresented backgrounds, and undermine the serious conversations about major issues on campus.

Just so disappointing to see them go down without a fight.
arun-das.bsky.social
Brown have agreed to do all this without any sort of legal challenge, and have agreed to these terms without consulting their staff, students or alumni.

Here's a link to the article that should bypass the paywall, if you wanted to read it for yourself. www.nytimes.com/2025/07/30/u...
Brown University Makes a Deal With the White House to Restore Funding
www.nytimes.com
arun-das.bsky.social
In short, all funding is restored and active cases are dismissed in exchange for a $50M commitment to state work force development, new compliance with the administration's discriminatory policies on transgender individuals, a slew of "anti-DEI" admissions policies. No admission of any wrongdoing.
arun-das.bsky.social
Extremely disappointed in my alma mater, who have chosen to fold without a fight and endanger the most vulnerable members of our community instead of standing up for them.

Spineless and shameful.
nytimes.com
Breaking News: Brown University was said to have reached a deal with the Trump administration to restore federal funding. nyti.ms/459nQzY
Brown University's campus. A headline reads: "Brown University Makes A Deal With the White House to Restore Funding." Photo credit: lan MacLellan for The New York Times.
Reposted by Arun Das
vikramshivakumar.bsky.social
Excited to share a new update to Mumemto, scaling MUM and conserved element finding to any size pangenome! Preprint out now w/ @benlangmead.bsky.social.
Mumemto scales to the new HPRC v2 release and beyond, and can merge in future assemblies without any recomputation! 1/n
Partitioned Multi-MUM finding for scalable pangenomics
Pangenome collections are growing to hundreds of high-quality genomes. This necessitates scalable methods for constructing pangenome alignments that can incorporate newly-sequenced assemblies. We prev...
www.biorxiv.org
arun-das.bsky.social
Easily the most important thing happening next week.

Come and watch my friend Sara defend her PhD!
saracarioscia.bsky.social
I'm defending my PhD next Friday, May 23!(!!!!). I'll be highlighting our work looking at aneuploidy in early human development. If you're interested I'd love to have you join via Zoom (DM me for info) or on the Homewood campus!
arun-das.bsky.social
Thank you! It was awesome to talk to you too, and to learn about all the cool data and insights from your project!
Reposted by Arun Das
aabiddanda.github.io
Really cool work from @arun-das.bsky.social on recovering sequence from unmapped reads (even with T2T reference or HPRC pangenomes!). Can recover a decent amount of sequence per individual using these approaches. Check it out!
arun-das.bsky.social
Our pre-print on investigating variation in South Asian genomes is now out!

Thank you to @mikeschatz.bsky.social, @rajivmccoy.bsky.social and @aabiddanda.bsky.social for all their work on this.

🧵 A thread on the key results and takeaways from our work:
biorxiv-genomic.bsky.social
Assembling unmapped reads reveals hidden variation in South Asian genomes https://www.biorxiv.org/content/10.1101/2025.05.14.653340v1
Reposted by Arun Das
rajivmccoy.bsky.social
@arun-das.bsky.social's thesis research demonstrates that short-read mapping-based approaches, even using complete linear (T2T-CHM13) and pangenome (HPRC) references, miss a lot of variation that can be recovered from unmapped reads.
arun-das.bsky.social
Our pre-print on investigating variation in South Asian genomes is now out!

Thank you to @mikeschatz.bsky.social, @rajivmccoy.bsky.social and @aabiddanda.bsky.social for all their work on this.

🧵 A thread on the key results and takeaways from our work:
biorxiv-genomic.bsky.social
Assembling unmapped reads reveals hidden variation in South Asian genomes https://www.biorxiv.org/content/10.1101/2025.05.14.653340v1
arun-das.bsky.social
Finally, we compare our placed contigs to loci associated with biomarker traits in the UK Biobank and East London Genes & Health Dataset, and find a number of positions where a placed contig is close to a significant locus.
Comparison of the significant loci associated with HDL cholesterol in the South Asian set in the UK Biobank to that of the East London Genes and Health cohort, alongside our placed contigs against GRCh38.
arun-das.bsky.social
We are also able to align existing RNA-Seq data from 140 SAS individuals from MAGE directly to these contigs, allowing us to identify 200 contigs with a high density of RNA-Seq alignments.

BLAST shows that these contigs are highly similar to non-reference human and primate sequences.
Plot showing the distribution of the most aligned-to contigs in our RNA-seq contigs, against the length of the contigs and colored by their population. The contigs vary widely in terms of their RNA-seq alignment density and their lengths.
arun-das.bsky.social
We show that the majority of the placements we make are missed by traditional insertion calling tools, but in line with specific large non-reference sequence detection ones.

For the unplaced contigs, BLAST shows that the majority have high similarity to non-reference human and primate sequences.
Comparison of our insertions to Manta. We find an order of magnitude more large insertions than this tool, and comparable amounts to existing large insertion callers.
arun-das.bsky.social
We are able to place ~20K contigs against CHM13 through a combination of alignment, mate pair read information and LD.

We find >8,000 instances of a placed contig intersecting one of 106 protein coding genes, and >6,000 placements within 1 Kb of a known GWAS site.
Visualization of placements throughout CHM13. We place contigs all over the genome. Plot of the number of unique gene intersections per chromosome. We see the number of intersections are largely correlated with chromosome size. Key genes we find intersected with contigs.
arun-das.bsky.social
We validate >80% of these contigs in a subset of 21 SAS individuals using auxiliary long read data.

We repeat the linear pipeline with the HPRC v1 draft pangenomes, and see further improvements in alignment but only small reductions in the amount of assembled sequence.
Plot of the fraction of assembled contigs from each of 21 SAS individuals that are validated by their long read assembly, ~85% of the contigs per individual are validated. Plot of the amount of assembled sequence per individual across two linear and two pangenome references. Massive reductions in sequence are seen as we go from GRCh8 to CHM13, but the HPRC pangenomes offer only a slight improvement after that.
arun-das.bsky.social
Despite improvements in alignment compared to GRCh38, we assemble ~600 Kb of sequence in >1 Kb contigs per individual from unmapped reads against T2T-CHM13.

Across the whole set, we assemble 410 Mb of sequence in 199K contigs (which collapses down to 50 Mb when accounting for shared sequence).
Comparison of alignment rate against GRCh38 and CHM13. We see a 0.5-1% improvement against CHM13. Histogram of amount of assembled sequence across 640 SAS individuals. We assemble on average 550-600 Kb per individual from unmapped reads.
arun-das.bsky.social
To do this, we align existing short read data from 640 South Asian (SAS) individuals from 1KGP and SGDP against linear & pangenome references, and assemble the unmapped reads into large contigs.

We then attempt to analyze the functional impact of these sequences.
Our analysis pipeline, which consists of 1) Aligning existing reads against reference genomes, 2) assembling unaligned or poorly aligned reads, 3) placing the large assembled contigs back into the reference, 4) calling variants and novel sequence, and 5) evaluating the functional impact of this variation. Source of our data. 601 individuals come from 5 1KGP populations, 39 come from 19 SGDP populations.
arun-das.bsky.social
South Asians are severely underrepresented in genomics, and this lack of representation makes it difficult to catalog and understand the variation present in these communities.

Our goal was to investigate the variation present in these populations that is missing in widely used reference genomes.
Comparison of the fraction of individuals in the GWAS catalog of different ancestries to the breakdown of the global population in terms of those ancestries. South Asians accounted for 2% of the GWAS catalog in 2019, but for >25% of the global population.