Mike Smith
banner
mjjsmith.com
Mike Smith
@mjjsmith.com
AI researcher who likes self-supervised, unsupervised, and generative models

Was at Herts and aspiaspace.com, now at astroai.cfa.harvard.edu and @universetbd.org
Pinned
Check out 100TB of machine learning ready multimodal astronomy data 🔥🦾

Excited to see what FOSS machine learning world does with this 🔭
Excited to announce the Multimodal Universe (MMU): a huge 100TB dataset bringing together the largest ML-focused collection of astronomical observations ever assembled to accelerate open AI and astronomy research

github.com/MultimodalUn...

Think ImageNet, but for space 🔭 #astrocode 🧵
New AstroPT models are out 🔭🎉 This time trained with an improved DESI galaxy image dataset. Link here: huggingface.co/Smith42/astr...

Check out these new scaling curves!

We are still seeing improvement at 800M parameters where before we stalled at 100M. Maybe high quality data is all you need 🤔
June 2, 2025 at 6:56 PM
Reposted by Mike Smith
Anyways, here's the paper - it's one of the first big uses of foundation models in astronomy that I'm aware of, and it seems to have worked really well! #extragalactic #astrocode 🧪
Euclid Quick Data Release (Q1) Exploring galaxy properties with a multi-modal foundation model
Modern astronomical surveys, such as the Euclid mission, produce high-dimensional, multi-modal data sets that include imaging and spectroscopic information for millions of galaxies. These data serve a...
arxiv.org
March 20, 2025 at 11:10 AM
Ooh we scalin'
May 31, 2025 at 10:19 PM
Reposted by Mike Smith
Strong Platonic Representation Hypothesis

All embedding models, given large enough scale, can be translated between them without paired data

Security implication: Embeddings aren’t encryption, they’re basically plain text

arxiv.org/abs/2505.12540
May 22, 2025 at 10:40 AM
May 12, 2025 at 12:48 PM
Reposted by Mike Smith
If you add "also Cthulhu-y" to the prompt, the results are pretty great.
May 9, 2025 at 4:58 AM
A great dataset to round out UTBD's 2^2 week 😎
📢 New dataset out!

We introduce HypoGen💥, a dataset of ~5.5K structured problem–hypothesis pairs (Bit–Flip–Spark + Chain‑of‑Reasoning) to advance LLM-driven scientific ideation💡.

Fine‑tuned LLaMA 3.1 8B & R1‑distilled models show significant gains. Humans are still the best🥇.
April 19, 2025 at 10:32 AM
Reposted by Mike Smith
📢 New dataset out!

We introduce HypoGen💥, a dataset of ~5.5K structured problem–hypothesis pairs (Bit–Flip–Spark + Chain‑of‑Reasoning) to advance LLM-driven scientific ideation💡.

Fine‑tuned LLaMA 3.1 8B & R1‑distilled models show significant gains. Humans are still the best🥇.
April 18, 2025 at 5:57 PM
Was great fun cooking this up with Sharaf and team! Check out all the code at github.com/UniverseTBD/... and paper at arxiv.org/abs/2504.08583
April 16, 2025 at 12:19 PM
Reposted by Mike Smith
🎉 HAPPY BIRTHDAY, UniverseTBD! 🚀
As we turn 2, we’re going 2^2.
Launching a new project per day for the next four days.
We hope that you all enjoy these works as much as we have enjoyed working on them. Stay tuned for the big reveals!
April 13, 2025 at 7:18 PM
tariffs getting so bad you can't even import numpy 🥲
April 12, 2025 at 1:19 PM
me: i didn't know you were cool like that
val loss:
February 6, 2025 at 10:32 AM
me: go left! ←←
my computer: best i can do is ^[[D
February 6, 2025 at 8:55 AM
Going to be a great talk 😎
Join us tomorrow (Feb 5th) at 10:00 UTC for our monthly talk 🚀 🔭! Maya Jablonska (ANU) will be presenting: "UniverseTBD: Democratising Science for Everyone"

We will be streaming on Discord (discord.gg/tSKx4h8p?eve...) and on Youtube (youtube.com/@UniverseTBD...)
Join the UniverseTBD Discord Server!
Check out the UniverseTBD community on Discord – hang out with 162 other members and enjoy free voice and text chat.
discord.gg
February 4, 2025 at 6:04 PM
arxiv.org/abs/2501.12499 super cool paper! Extracting useful information from astro time series via RNNs
Multiband Embeddings of Light Curves
In this work, we propose a novel ensemble of recurrent neural networks (RNNs) that considers the multiband and non-uniform cadence without having to compute complex features. Our proposed model consis...
arxiv.org
January 23, 2025 at 9:58 AM
With r1 and o1, Yann Lecun's cake is now baked and ready
January 23, 2025 at 8:07 AM
The final frontier for AI will be anything that can't be captured via a quantitative benchmark
January 23, 2025 at 7:37 AM
broke: reading 500 page AI safety papers

woke: learning AI alignment best practices from "wallace & gromit: vengeance most fowl"
January 12, 2025 at 7:46 AM
first time spotting AI art in the wild

just look at that floating ship!
January 9, 2025 at 3:00 PM
the nvidia digits case is so british-housing-core
January 8, 2025 at 2:14 PM
Reposted by Mike Smith
This makes so much sense – more cooperation when more willing to wait for rewards ~= less risk averse.

www.sciencedirect.com/science/arti...
Patience is a virtue: Cooperative people have lower discount rates
Reciprocal altruism involves foregoing an immediate benefit for the sake of a greater long-term reward. It follows that individuals who exhibit a stro…
www.sciencedirect.com
January 6, 2025 at 10:22 AM
Reposted by Mike Smith
Think twice before gifting someone an M dwarf this holiday season
December 22, 2024 at 4:22 AM
Reposted by Mike Smith
I move to refer to `'Gold OA' as 'Pay to Publish'
December 22, 2024 at 11:12 AM
o3 got me thinking about the future of selling our labour as code... how many more iterations until that's transformed? 😅
December 21, 2024 at 1:02 AM