Marco Cuturi
@marcocuturi.bsky.social
740 followers 58 following 21 posts
machine learning researcher @ Apple machine learning research
Posts Media Videos Starter Packs
Reposted by Marco Cuturi
mkirchhof.bsky.social
LLMs are currently this one big parameter block that stores all sort of facts. In our new preprint, we add context-specific memory parameters to the model, and pretrain the model along with a big bank of memories.

📑 arxiv.org/abs/2510.02375

[1/10]🧵
Reposted by Marco Cuturi
davidpicard.bsky.social
Wow! Finally OT done on the entire training set to train a diffusion model!
marcocuturi.bsky.social
Our two phenomenal interns, Alireza Mousavi-Hosseini and Stephen Zhang @syz.bsky.social have been cooking some really cool work with Michal Klein and me over the summer.

Relying on optimal transport couplings (to pick noise and data pairs) should, in principle, be helpful to guide flow matching

🧵
marcocuturi.bsky.social
Then there's always 𝜀 regularization. When 𝜀=∞, we recover vanilla FM. At this point we're not completely sure whether 𝜀=0 is better than 𝜀>0, they both work! 𝜀=0 has a minor edge in larger scales (sparse gradients, faster assignment, slightly better metrics), but 𝜀>0 is also useful (faster SGD)
marcocuturi.bsky.social
Thanks for the nice comments! my interpretation is that we're using OT to produce pairs (x_i,y_i) to guide FM. With that, it's up to you to provide an inductive bias (a model) that gets f(x)~=y while generalizing. The hard OT assignment could be that model, but it would fail to generalize.
marcocuturi.bsky.social
for people that like OT, IMHO the very encouraging insight is that we have evidence that the "better" you solve your OT problem, the more flow matching metrics improve, this is Figure 3
marcocuturi.bsky.social
Thanks @rflamary.bsky.social! yes, exactly. We try to summarize this tradeoff in Table 1, in which we show that for a one-off preprocessing cost, we now get all (noise,data) pairings you might need during flow matching training for "free" (up to the MIPS lookup for each noise).
marcocuturi.bsky.social
the paper is out: arxiv.org/abs/2509.25519

Michal also did a fantastic push to open source the semidiscrete solver prepared by Stephen and Alireza in the OTT-JAX library. We plan to open source the flow pipeline in JAX soon. Please reach out if interested!
Flow Matching with Semidiscrete Couplings
Flow models parameterized as time-dependent velocity fields can generate data from noise by integrating an ODE. These models are often trained using flow matching, i.e. by sampling random pairs of noi...
arxiv.org
marcocuturi.bsky.social
This much faster than using Sinkhorn, and generates with higher quality.

As a bonus, you can forget about entropy regularization (set ε=0), apply things like correctors to guidance, and use it on consistency-type models, or even with conditional generation.
marcocuturi.bsky.social
the great thing with SD-OT is that this only needs to be computed once. You only need to store a real number per data sample. You can precompute these numbers once & for all using stochastic convex optimization.

When training a flow model, you assign noise to data using these numbers.
marcocuturi.bsky.social
In practice, however, this idea only begins to work when using massive batch sizes (see arxiv.org/abs/2506.05526). The problem is that the costs of running Sinkhorn on millions of points can quickly balloon...

Our solution? rely on semidiscrete OT at scales that were never considered before.
marcocuturi.bsky.social
Our two phenomenal interns, Alireza Mousavi-Hosseini and Stephen Zhang @syz.bsky.social have been cooking some really cool work with Michal Klein and me over the summer.

Relying on optimal transport couplings (to pick noise and data pairs) should, in principle, be helpful to guide flow matching

🧵
Reposted by Marco Cuturi
peteryugray.bsky.social
New Apple #ML Research Highlight: The "Super Weight:" How Even a Single Parameter can Determine an #LLM's Behavior machinelearning.apple.com/research/the...
The
A recent paper from Apple researchers,
machinelearning.apple.com
marcocuturi.bsky.social
you're right that the PCs' message uses space as a justification to accept less papers, but it does not explicitly mention that the acceptance rate should be lower than the historical standard of 25%. In my SAC batch, the average acceptance before their email was closer to 30%, but that's just me..
marcocuturi.bsky.social
I see it a bit differently. The new system pushed reviewers aggressively to react to rebuttals. I think this is a great change, but this has clearly skewed results, creating many spurious grade upgrades. Now the system must be rebalanced in the other direction by SAC/AC for results to be fair..
marcocuturi.bsky.social
scaling up the computation of optimal transport couplings to hundreds of thousands of 3k dimensional vectors made easy using sharding and OTT-JAX! check this notebook, it only takes a few lines of code thanks to JAX's native sharding abilities ott-jax.readthedocs.io/en/latest/tu...
Sharded Sinkhorn — ott 0.5.1.dev34+g3462f28 documentation
ott-jax.readthedocs.io
Reposted by Marco Cuturi
paulineluc.bsky.social
So pleased and proud to share with you what our team has been up to, on an ambitious journey to build a video foundation model for scientific domains ! ✨ 🚀 🎞️ 🧪 #ICCV2025 #AI4Science
hassony2.bsky.social
Thrilled to share our latest work on SciVid, to appear at #ICCV2025! 🎉
SciVid offers cross-domain evaluation of video models in scientific applications, including medical CV, animal behavior, & weather forecasting 🧪🌍📽️🪰🐭🫀🌦️
📝 Check out our paper: arxiv.org/abs/2507.03578
[1/4]🧵
Reposted by Marco Cuturi
Reposted by Marco Cuturi
mkirchhof.bsky.social
Can LLMs access and describe their own internal distributions? With my colleagues at Apple, I invite you to take a leap forward and make LLM uncertainty quantification what it can be.
📄 arxiv.org/abs/2505.20295
💻 github.com/apple/ml-sel...
🧵1/9
Reposted by Marco Cuturi
silingao.bsky.social
NEW PAPER ALERT: Recent studies have shown that LLMs often lack robustness to distribution shifts in their reasoning. Our paper proposes a new method, AbstRaL, to augment LLMs’ reasoning robustness, by promoting their abstract thinking with granular reinforcement learning.
Reposted by Marco Cuturi
maureendeseyssel.bsky.social
Now that @interspeech.bsky.social registration is open, time for some shameless promo!

Sign-up and join our Interspeech tutorial: Speech Technology Meets Early Language Acquisition: How Interdisciplinary Efforts Benefit Both Fields. 🗣️👶

www.interspeech2025.org/tutorials

⬇️ (1/2)
https://www.interspeech2025.org/tutorials
Your cookies are disabled, please enable them.
www.interspeech2025.org
Reposted by Marco Cuturi
cemkoch.bsky.social
Today we have released the code and a demo iOS application for FastVLM - our extremely efficient and fast vision language model which runs on your device using MLX! You can check out the code and the app here: github.com/apple/ml-fas...
Reposted by Marco Cuturi
davidgrangier.bsky.social
#ICLR #TrainBetterLM I am at ICLR, come to our posters for improved language model training!

Recycle gradients for faster neural net training with AdEMAmix iclr.cc/virtual/2025... (Fri Apr 25, 10 am).

1/3