Author | Lightnews

Mike Smith

@mjjsmith.com

AI researcher who likes self-supervised, unsupervised, and generative models

Was at Herts and aspiaspace.com, now at astroai.cfa.harvard.edu and @universetbd.org

Posts Replies Media Videos

Pinned

Mike Smith @mjjsmith.com · Dec 2

Check out 100TB of machine learning ready multimodal astronomy data 🔥🦾

Excited to see what FOSS machine learning world does with this 🔭

UniverseTBD @universetbd.org · Dec 2

Excited to announce the Multimodal Universe (MMU): a huge 100TB dataset bringing together the largest ML-focused collection of astronomical observations ever assembled to accelerate open AI and astronomy research

github.com/MultimodalUn...

Think ImageNet, but for space 🔭 #astrocode 🧵

Mike Smith

@mjjsmith.com

New AstroPT models are out 🔭🎉 This time trained with an improved DESI galaxy image dataset. Link here: huggingface.co/Smith42/astr...

Check out these new scaling curves!

We are still seeing improvement at 800M parameters where before we stalled at 100M. Maybe high quality data is all you need 🤔

June 2, 2025 at 6:56 PM

Reposted by Mike Smith

Emily Hunt

@emily.space

Anyways, here's the paper - it's one of the first big uses of foundation models in astronomy that I'm aware of, and it seems to have worked really well! #extragalactic #astrocode 🧪

Euclid Quick Data Release (Q1) Exploring galaxy properties with a multi-modal foundation model

Modern astronomical surveys, such as the Euclid mission, produce high-dimensional, multi-modal data sets that include imaging and spectroscopic information for millions of galaxies. These data serve a...

arxiv.org

March 20, 2025 at 11:10 AM

Mike Smith

@mjjsmith.com

Ooh we scalin'

May 31, 2025 at 10:19 PM

Reposted by Mike Smith

Tim Kellogg

@timkellogg.me

Strong Platonic Representation Hypothesis

All embedding models, given large enough scale, can be translated between them without paired data

Security implication: Embeddings aren’t encryption, they’re basically plain text

arxiv.org/abs/2505.12540

This side-by-side plot compares two visualizations of the same dataset’s embeddings: on the left, the original embeddings; on the right, latent representations generated via a transformation method (here labeled vec2vec).

Left: Original Embeddings
• Two clearly separated clusters of red and green points.
• The clusters represent two distinct groups (e.g., classes, domains, or modalities).
• Gray lines show strong alignment or correspondences between red and green points—suggesting some shared structure or matched pairs.
• However, the clusters are far apart, meaning the original embedding space encodes strong domain-specific separation (e.g., red and green are treated as different).

Right: Latent Representations (vec2vec)
• The same points are now more uniformly mixed in latent space.
• The tight clustering by color is gone; red and green points are distributed throughout.
• This suggests the vec2vec method has projected both groups into a shared latent space, removing domain bias and aligning semantically similar items regardless of origin.
• It’s indicative of embedding alignment, domain adaptation, or representation unification, where cross-domain items are mapped closer together based on semantic similarity.

Implication:

vec2vec successfully transforms the original domain-specific embeddings into a common space where structural similarity dominates over origin (color), enabling better transfer, comparison, or fusion between domains.

May 22, 2025 at 10:40 AM

Mike Smith

@mjjsmith.com

May 12, 2025 at 12:48 PM

Reposted by Mike Smith

Ethan Mollick

@emollick.bsky.social

If you add "also Cthulhu-y" to the prompt, the results are pretty great.

May 9, 2025 at 4:58 AM

Mike Smith

@mjjsmith.com

A great dataset to round out UTBD's 2^2 week 😎

UniverseTBD @universetbd.org · Apr 18

📢 New dataset out!

We introduce HypoGen💥, a dataset of ~5.5K structured problem–hypothesis pairs (Bit–Flip–Spark + Chain‑of‑Reasoning) to advance LLM-driven scientific ideation💡.

Fine‑tuned LLaMA 3.1 8B & R1‑distilled models show significant gains. Humans are still the best🥇.

April 19, 2025 at 10:32 AM

Reposted by Mike Smith

UniverseTBD

@universetbd.org

April 18, 2025 at 5:57 PM

Mike Smith

@mjjsmith.com

Was great fun cooking this up with Sharaf and team! Check out all the code at github.com/UniverseTBD/... and paper at arxiv.org/abs/2504.08583

April 16, 2025 at 12:19 PM

Reposted by Mike Smith

UniverseTBD

@universetbd.org

🎉 HAPPY BIRTHDAY, UniverseTBD! 🚀
As we turn 2, we’re going 2^2.
Launching a new project per day for the next four days.
We hope that you all enjoy these works as much as we have enjoyed working on them. Stay tuned for the big reveals!

April 13, 2025 at 7:18 PM

Mike Smith

@mjjsmith.com

tariffs getting so bad you can't even import numpy 🥲

April 12, 2025 at 1:19 PM

Mike Smith

@mjjsmith.com

me: i didn't know you were cool like that
val loss:

February 6, 2025 at 10:32 AM

Mike Smith

@mjjsmith.com

me: go left! ←←
my computer: best i can do is ^[[D

February 6, 2025 at 8:55 AM

Mike Smith

@mjjsmith.com

Going to be a great talk 😎

UniverseTBD @universetbd.org · Feb 4

Join us tomorrow (Feb 5th) at 10:00 UTC for our monthly talk 🚀 🔭! Maya Jablonska (ANU) will be presenting: "UniverseTBD: Democratising Science for Everyone"

We will be streaming on Discord (discord.gg/tSKx4h8p?eve...) and on Youtube (youtube.com/@UniverseTBD...)

Join the UniverseTBD Discord Server!

Check out the UniverseTBD community on Discord – hang out with 162 other members and enjoy free voice and text chat.

discord.gg

February 4, 2025 at 6:04 PM