Paul Chang
@mummitrollet.bsky.social
110 followers 210 following 70 posts
ML + stuff @Datacrunch
Posts Media Videos Starter Packs
Pinned
mummitrollet.bsky.social
Speaking with @trappmartin.bsky.social, I realised there is no good space online to discuss ML-related stuff. It feels like a lot of people are on Bsky, but not much discussion I am myself guilty. So I am going to try and be more active here.
Reposted by Paul Chang
datacrunch.io
❗️ We just expanded our capacity of B200 SXM6 180GB servers – available in the DataCrunch Cloud Platform.

The best thing is…

You can deploy the Blackwell platform without approvals.

Just sign in, select the instance type, and start your deployment:

cloud.datacrunch.io?utm_source=b...
mummitrollet.bsky.social
Also pretty cool to see open source community building on top of each other!
mummitrollet.bsky.social
The paper also suggests Group Tied Attention (GTA), which works in the opposite direction and draws inspiration from MLA, incorporating those techniques into GQA.
mummitrollet.bsky.social
The technique called Grouped Latent Attention (GLA) can now be split across devices according to the group, providing higher throughput without a drop in performance by maintaining high arithmetic intensity and achieving better parallelism.
mummitrollet.bsky.social
Well, the paper suggests a hybrid method. What about using MLA and adding groups?
mummitrollet.bsky.social
Instead, one must make a copy of the latent component across GPUs, which feels wasteful.
mummitrollet.bsky.social
This is where MLA is somewhat awkward, and GQA scores some points back. MLA uses a single large latent head that must be replicated across all tensor-parallel GPUs, which means that sharding the attention computations across GPUs cannot be done.
mummitrollet.bsky.social
First of all, a confession! In the blog titled 'Multi-Head Latent Attention: Benefits in Memory and Computation', we didn't tell the whole story—the benchmarking on a single GPU. In reality, for DeepSeek V3-style models, parallelization is needed.
mummitrollet.bsky.social
The paper focuses on designing more effective decoding attention for inference in light of Multi-head Latent Attention (MLA) and Group Query Attention (GQA).
Reposted by Paul Chang
datacrunch.io
🆕 Inference API for FLUX.1 Kontext [max] & [pro] are now available on DataCrunch!

We are an infrastructure partner of Black Forest Labs for Kontext, a suite of generative flow matching models for text-to-image and image-to-image editing.

Learn more: datacrunch.io/managed-endp...
Reposted by Paul Chang
datacrunch.io
🚨 Summer Inference by Symposium AI is happening next Wednesday, June 4, at 16:00-22:00.

🇫🇮 This event will bring together 250 AI engineers, researchers, and founders under one roof in Helsinki.

🔗 You can still grab one of the last remaining seats: lu.ma/x5hhj79x
Symposium AI - Summer Inference · Luma
Join 250 leading AI builders for an epic night in Helsinki! Symposium AI events bring together top AI talent, researchers, and engineers who are actively…
lu.ma
mummitrollet.bsky.social
However, more is at play; revisiting Kipply's infamous Transformer Inference Arithmetic article shows that the MLA mechanism used during inference is now compute-bound 🖥️ and not memory-bound 💾.
mummitrollet.bsky.social
Looking at the projections involved in DeepSeeek's attention (MLA) of the KV cache automatically makes one think it means less memory needed in HBM, preventing dreaded out-of-memory errors 👿 .
mummitrollet.bsky.social
Algorithm hardware co-design was a big reason the whale 🐋(DeepSeek) made such a splash 💦 with its V3 and R1 releases.
mummitrollet.bsky.social
This is very true! Go and speak to people in more old-school businesses and you quickly realize that with current models you could already do so much.
Reposted by Paul Chang
emollick.bsky.social
I don’t mean to be a broken record but AI development could stop at the o3/Gemini 2.5 level and we would have a decade of major changes across entire professions & industries (medicine, law, education, coding…) as we figure out how to actually use it & adapt our systems.

AI disruption is baked in.
Reposted by Paul Chang
lacerbi.bsky.social
1/ If you are at ICLR / AABI / AISTATS, check out work from our lab and collaborators on *inference everywhere anytime all at once*!

Go talk to my incredible PhD students @huangdaolang.bsky.social & @chengkunli.bsky.social + amazing collaborator Severi Rissanen.

@univhelsinkics.bsky.social FCAI
Reposted by Paul Chang
nsaphra.bsky.social
I wrote something up for AI people who want to get into bluesky and either couldn't assemble an exciting feed or gave up doomscrolling when their Following feed switched to talking politics 24/7.
The AI Researcher's Guide to a Non-Boring Bluesky Feed | Naomi Saphra
How to migrate to bsky without a boring feed.
nsaphra.net
Reposted by Paul Chang
lacerbi.bsky.social
1/10🔥 New paper alert in #AABI2025 Proceedings!

Normalizing Flow Regression (NFR) — an offline Bayesian inference method.

What if you could get a full posterior using *only* the evaluations you *already* have, maybe from optimization runs?