John F Wu
@jwuphysics.bsky.social
3.6K followers 620 following 960 posts
Tenure-track astronomer at STScI/JHU working on galaxies, machine learning, and AI for scientific discovery. Opinions my own. He/him. Website: https://jwuphysics.github.io/
Posts Media Videos Starter Packs
jwuphysics.bsky.social
Catch us in Montreal!
jhucompsci.bsky.social
In “Rank1: Test-Time Compute for Reranking in Information Retrieval,” @orionweller.bsky.social, Kathryn Ricci, Eugene Yang, Andrew Yates, Dawn Lawrie, and Benjamin Van Durme introduce the first reranking model trained to take advantage of test-time 🕰️ compute: (2/3)
Rank1: Test-Time Compute for Reranking in Information Retrieval
We introduce Rank1, the first reranking model trained to take advantage of test-time compute. Rank1 demonstrates the applicability within retrieval of using a reasoning language model (i.e. OpenAI's o...
arxiv.org
Reposted by John F Wu
jessyjli.bsky.social
Here is a genuine one :) CosmicAI’s AstroVisBench, to appear at #NeurIPS

bsky.app/profile/nsfs...
nsfsimonscosmicai.bsky.social
Exciting news! Introducing AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy!

A new benchmark developed by researchers at the NSF-Simons AI Institute for Cosmic Origins is testing how well LLMs implement scientific workflows in astronomy and visualize results.
jwuphysics.bsky.social
Okay I guess I should be more fair. This isn't the worst offender, but I'm still not a fan: it misses loads of relevant citations, doesn't release the benchmark, its example questions are meh (see MIRI question in Fig 1), and multiple choice is known to be bad (see e.g. arxiv.org/abs/2507.02856)
Answer Matching Outperforms Multiple Choice for Language Model Evaluation
Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from ...
arxiv.org
jwuphysics.bsky.social
Fantastic work on the size–mass relation for low-mass galaxies, led by Yasmeen (@yasmeenasali.bsky.social)!

arxiv.org/abs/2509.25335

🔭🌌🧪
A sneak preview of Figure 6 in the paper, which shows size (r-band radius) vs stellar mass for SAGA satellites, SAGA background galaxies, and SDSS isolated galaxies. They all obey the same trends but have small offsets, which appears unlikely to be driven by SFR but *does* seem to be driven by environment!
jwuphysics.bsky.social
Any mutuals going to be in Montréal next week? Give me a shout if so!

I'll be attending COLM and visiting UdeM, Ciela, and Mila, and presenting on various topics spanning ML applications in galaxy evolution to interpretable AI for scientific discovery.
jwuphysics.bsky.social
Anyone else going to COLM? Give me a shout!

Also, check out our poster on evaluating LLMs for astronomy research. This work came out of our 2024 JSALT research and was jointly led by undergrads Alina Hyk and Kiera McCormick!
Screenshot of our abstract from the COLM schedule page, printed below

Thursday, October 9th

Title: From Queries to Criteria: Understanding How Astronomers Evaluate LLMs

11:00 AM – 1:00 PM
710

Authors: Alina Hyk, Kiera McCormick, Mian Zhong, Ioana Ciucă, Sanjib Sharma, John F Wu, J. E. G. Peek, Kartheik G. Iyer, Ziang Xiao, Anjalie Field

Abstract
There is growing interest in leveraging LLMs to aid in astronomy and other scientific research, but benchmarks for LLM evaluation in general have not kept pace with the increasingly diverse ways that real people evaluate and use these models. In this study, we seek to improve evaluation procedures by building an understanding of how users evaluate LLMs. We focus on a particular use case: an LLM-powered retrieval-augmented generation bot for engaging with astronomical literature, which we deployed via Slack. Our inductive coding of 368 queries to the bot over four weeks and our follow-up interviews with 11 astronomers reveal how humans evaluated this system, including the types of questions asked and the criteria for judging responses. We synthesize our findings into concrete recommendations for building better benchmarks, which we then employ in constructing a sample benchmark for evaluating LLMs for astronomy. Overall, our work offers ways to improve LLM evaluation and ultimately usability, particularly for use in scientific research.
jwuphysics.bsky.social
I assume the submission is 9 pp and then the camera ready is 10 pp. Strange that they wrote "submission version" every time...
jwuphysics.bsky.social
Thanks Baltimore DOT for penalizing 40+ mph speeders more harshly, but by that point shouldn't you be revoking their driving license?
Baltimore new tiered speeding fine structure, ranging from $40 (<15 mph over), $70 (16-19 mph), $120 (20-29 mph), $230 (30-39 mph), and $425 (40+ mph)
Reposted by John F Wu
heywritergrace.bsky.social
*Girl who's excited for @baltimorebeat.bsky.social's FIRST food issue, coming tomorrow* 🍔🌭🌮🍕
jwuphysics.bsky.social
They really don't pay you guys enough to be subjected to the disappointment that is eating at Chipotle in Baltimore

Also 27 points for the one in Mt Vernon?! They haven't once fulfilled my order correctly or had all menu items in stock.
jwuphysics.bsky.social
Tbf Lamar usually doesn't throw it away, and he somehow makes magic out of it. Not this time...
jwuphysics.bsky.social
Didn't seem like the o line had any idea who their blocking assignments were.
jwuphysics.bsky.social
While democracy dies in darkness, let me just say that one of my most prized possessions is the re-launch print of the @baltimorebeat.bsky.social from 2020
jwuphysics.bsky.social
Two things about this paper.

1. This is legitimately useful information
2. The supplementary material shows the experimental set up... and they perform all experiments in a Bialetti Moka pot box, because of course they did
Supplementary Fig 1 of the paper, showing the schematic of the heating equipment (left) and image acquisition set up (right) Supplementary Fig 1 of the paper, showing the actual heating equipment (left) and image acquisition set up (right). The right-side shows a camera mounted over an illuminated box. The box is the outside packaging of the popular Bialetti moka pot.
Reposted by John F Wu
ekatrukha.bsky.social
A cell finding its way through the matrix, imaged with @joycemeiri.bsky.social on LLS.
jwuphysics.bsky.social
That's how I learned it!
jwuphysics.bsky.social
TIL that @colmweb.org is pronounced like "Collum"!
Reposted by John F Wu
melina-iras07572.bsky.social
The jellyfish #galaxy MACSJ0451-JFG1 in a galaxy cluster with #JWST NIRCam. 🔭

The galaxy is experiencing ram-pressure stripping. It moves trough the intracluster medium and is stripped of gas, leaving tails that form stars.

My image processing from today: commons.wikimedia.org/wiki/File:MA...
A galaxy with long filament structures tailing behind the galaxy, like the tentacles of a jellyfish. The shape of the galaxy is also slightly warped.
jwuphysics.bsky.social
And I didn't even have to pay a billionaire! Wow!
verifiedusers.bsky.social
🛡️ @jwuphysics.bsky.social has been verified by @bsky.team Track verified accounts and trusted verifiers at bverified.vercel.app!
jwuphysics.bsky.social
Impressive sleuthing!

Careful observations 🤝 careful statistical modeling
mattkenworthy.bsky.social
Some sad #exoplanets news in a paper led by me: “YSES 2b is a background star”. A distant M dwarf star some 2 kiloparsecs behind the star just so happens to have a non-zero proper motion in EXACTLY the wrong direction: this required multiple GRAVITY observations to solve… #astrosci #astrodon
A coiled spiral representing proper motion plus parallax shows a fit to several astrometric points, showing that the object next to the star YSES 2 is probably a very distant M dwarf star far away in our Galaxy. Drat, damn and blast!