Lightnews — Scholar-powered news

Reposted by Benjamin Lefaudeux 🇺🇦

Tim Kellogg

@timkellogg.me

Limits of vector search

a new GDM paper shows that embeddings can’t represent combinations of concepts well

e.g. Dave likes blue trucks AND Ford trucks

even k=2 sub-predicates make SOTA embedding models fall apart

www.alphaxiv.org/pdf/2508.21038

On the Theoretical Limitations of Embedding-Based Retrieval | alphaXiv

View recent discussion. Abstract: Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-followi...

www.alphaxiv.org

August 31, 2025 at 11:07 AM

Reposted by Benjamin Lefaudeux 🇺🇦

Tim Kellogg

@timkellogg.me

Longcat-Flash-Chat (560B)

uh, holy shit this one is intriguing. bare minimum they compare themselves to all the (actual) top models and do okay

but inside.. damn this one has some cool ideas

huggingface.co/meituan-long...

The image is a multi-panel bar chart comparing performance of different large language models across several benchmarks. It is divided into four categories: General Domains, Agentic Tool Use, Code, and Instruction Following. Each panel has bars representing model results, with scores on the y-axis.

Top row – General Domains:
• ArenaHard-V2: LongGPT-Flash leads with 86.5, followed by Kimi K2 (88.2), DeepSeek V3.1 (84.1), Claude Sonnet (61.5), GPT-4.1 (62.1), Qwen3.5 MoE-2507 (85.7), and Gemini 2.5 Flash (77.0).
• MMLU-Pro: Best scores are Kimi K2 (84.5) and DeepSeek V3.1 (84.5), with LongGPT-Flash (82.7), Qwen3.5 MoE-2507 (82.1), GPT-4.1 (81.7), Claude Sonnet (83.7), Gemini 2.5 Flash (82.0).

Top row – Agentic Tool Use:
• t2-Bench (average): LongGPT-Flash leads (67.7), Kimi K2 (64.2), Claude Sonnet (62.1), GPT-4.1 (55.1), DeepSeek V3.1 (49.8), Qwen3.5 MoE-2507 (43.0), Gemini 2.5 Flash (40.9).
• VitaBench: LongGPT-Flash 24.3, Claude Sonnet 23.0, DeepSeek V3.1 20.3, Kimi K2 18.2, GPT-4.1 19.0, Qwen3.5 MoE-2507 8.5, Gemini 2.5 Flash 8.0.

Bottom row – Code:
• SWE-Bench-Verified: Claude Sonnet leads with 68.0, Kimi K2 64.6, DeepSeek V3.1 66.0, LongGPT-Flash 60.4, GPT-4.1 48.6, Qwen3.5 MoE-2507 42.0, Gemini 2.5 Flash 40.6.
• TerminalBench: Claude Sonnet 40.7, LongGPT-Flash 39.5, DeepSeek V3.1 31.3, GPT-4.1 28.4, Kimi K2 25.9, Qwen3.5 MoE-2507 17.3, Gemini 2.5 Flash 12.4.

Bottom row – Instruction Following:
• COLLIE: LongGPT-Flash 57.1, Kimi K2 56.3, Claude Sonnet 51.2, GPT-4.1 50.0, DeepSeek V3.1 49.7, Gemini 2.5 Flash 48.6, Qwen3.5 MoE-2507 43.8.
• Meeseeks (ZH): LongGPT-Flash 43.0, Kimi K2 42.8, Claude Sonnet 41.5, GPT-4.1 35.1, DeepSeek V3.1 35.3, Qwen3.5 MoE-2507 33.8, Gemini 2.5 Flash 34.8.

August 31, 2025 at 11:20 AM

Reposted by Benjamin Lefaudeux 🇺🇦

Ted Underwood

@tedunderwood.com

In 2012 when I had to clean data it seemed natural to look for rules I could use to clean it.

Now it seems natural to model the noise, find new clean data it can destroy, and then train a model to reverse the process.

Machine learning makes you a sicko.

July 27, 2025 at 11:16 AM

Reposted by Benjamin Lefaudeux 🇺🇦

Ethan Mollick

@emollick.bsky.social

Three things to note about this:

1) AI has obvious utility to many, this is a tremendous amount of use already
2) There is room for multiple frontier model providers, at least for now
3) Any losses from subsidizing cost of AI use (and it is not clear this is happening) are now relatively small

July 26, 2025 at 7:33 PM

Benjamin Lefaudeux 🇺🇦

@bentheegg.bsky.social

"The Serial Scaling Hypothesis" (arxiv.org/abs/2507.125..., Liu et al) is interesting I think, not as new as it completely looks (autoregressive models are used serially, models have depth,..) but feels like a good formalization and intuition as of where current GPT based LLMs will typically fail

July 26, 2025 at 9:58 PM

Reposted by Benjamin Lefaudeux 🇺🇦

Andrei Bursuc

@abursuc.bsky.social

1/ Can open-data models beat DINOv2? Today we release Franca, a fully open-sourced vision foundation model. Franca with ViT-G backbone matches (and often beats) proprietary models like SigLIPv2, CLIP, DINOv2 on various benchmarks setting a new standard for open-source research.

July 21, 2025 at 2:47 PM

Benjamin Lefaudeux 🇺🇦

@bentheegg.bsky.social

In the coming age of agents, I think vibe coding will die out, same lasting power as prompt engineering. For things LLMs excell at, you might as well stick to higher level directives and let it own the work, Claude Code is a good example. 1/2

July 18, 2025 at 9:52 AM

Reposted by Benjamin Lefaudeux 🇺🇦

Tim Kellogg

@timkellogg.me

this is probably why Meta was able to poach OpenAI ppl

aside from the absolute piles of cash, Sama is very SV-minded and can’t imagine building apart from a product

a lot of accelerationists see things differently, more broadly, and ids dissatisfying to be forced into a product box

Tim Kellogg @timkellogg.me · Jul 13

explaining why they open sourced — to ensure that it’s broadly useful

OpenAI self-admits that they optimize their models for ChatGPT, o3 was made for DeepResearch

Moonshot was dissatisfied with that

For a closed-source ChatBot service, users have no idea what workflow or how many models are behind it. I've heard rumors that some major companies have dozens of models, hundreds of scenario classifications, and countless workflows behind their interfaces, claiming this is an "MoE model." Under "application-first" or "user experience-first" values, this approach is very natural and far more cost-effective than a single model.
But this clearly isn't what AGI should look like. For a startup like Kimi, this approach not only makes you increasingly mediocre and greatly hinders technical progress but also makes it impossible to compete with major companies that have PMs polishing every button.

July 13, 2025 at 10:27 PM

Benjamin Lefaudeux 🇺🇦

@bentheegg.bsky.social

Still not a lot of ML talk on bsky (at least in my feed), hence paper Sunday: my two most interesting recent reads
- H Nets arxiv.org/abs/2507.07955
- Energy Based Transformers arxiv.org/abs/2507.02092

Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

Despite incredible progress in language models (LMs) in recent years, largely resulting from moving away from specialized models designed for specific tasks to general models based on powerful archite...

arxiv.org

July 13, 2025 at 6:14 AM

Benjamin Lefaudeux 🇺🇦

@bentheegg.bsky.social

Little bit of personal news, shared in other circles already: I'm moving to Mistral in August, after three years at Photoroom. I'm really proud of what we built in the ML team with relatively limited means, lasting SOTA on the existing foundations (saliency segmentation) while growing a lot on genAI

July 12, 2025 at 8:01 AM

Reposted by Benjamin Lefaudeux 🇺🇦

Vision and Graphics Trends

@si-cv-graphics.bsky.social

𝗗𝗲𝗽𝘁𝗵 𝗔𝗻𝘆𝘁𝗵𝗶𝗻𝗴 𝗮𝘁 𝗔𝗻𝘆 𝗖𝗼𝗻𝗱𝗶𝘁𝗶𝗼𝗻
Boyuan Sun, Modi Jin, Bowen Yin, Qibin Hou
arxiv.org/abs/2507.01634
Trending on www.scholar-inbox.com

July 7, 2025 at 6:00 AM

Reposted by Benjamin Lefaudeux 🇺🇦

Tim Kellogg

@timkellogg.me

kyutai open sources its TTS model as well as Unmute, a framework for building audio AI apps

notable:
- high accuracy
- actually streaming (can use streaming text input)
- serves 32 simultaneous users on a single GPU
- voice cloning
- supports all 24 official EU languages

kyutai.org/next/tts

A text-to-speech optimized for real-time usage.

kyutai.org

July 7, 2025 at 11:07 AM

Benjamin Lefaudeux 🇺🇦

@bentheegg.bsky.social

Alex Nichol is one the rare many-hits researchers of the field, with on top of that a track record of practical models which affect the public/ship. That Meta wouldn't target him is pretty rich

Alex Nichol @unixpickle.bsky.social · Jun 29

Kinda offended that meta didn't try to recruit me 😂

June 29, 2025 at 8:12 PM

Benjamin Lefaudeux 🇺🇦

@bentheegg.bsky.social

Automatically generate a fused megakernel in triton.. diving in, but if it works half as well as it reads it would already be quite something. Aligns with torch.compile of course

github.com/mirage-proje...

GitHub - mirage-project/mirage: Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA

Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA - mirage-project/mirage

github.com

June 24, 2025 at 7:49 PM

Benjamin Lefaudeux 🇺🇦

@bentheegg.bsky.social

Sharing that Photoroom open sourced _Dataroom_, as promised some time ago.

Accompanying blog post and mini thread
github.com/photoroom/da...
www.photoroom.com/inside-photo...

1/N

Photoroom Visual Ads Automation & GenerateBanners Acquisition

Photoroom launches Visual Ads Automation, a GenAI API turning product catalogs into branded ad creatives; GenerateBanners acquisition adds text automation.

www.photoroom.com

June 23, 2025 at 10:00 AM

Benjamin Lefaudeux 🇺🇦

@bentheegg.bsky.social

Still haven't tried Cursor, but I recently moved from Github Copilot to Continue with Codestral (free API), and it's absurd how much better Continue with Codestral is (vs. Copilot with expensive and slow models).

Made me realize that there is zero moat in this field, at least for Copilot.

June 19, 2025 at 8:14 AM

Reposted by Benjamin Lefaudeux 🇺🇦

Drew Breunig

@dbreunig.bsky.social

In the last 2 weeks:

- Slack locked down its messages data.
- X locked down its post data.
- Anthropic cut off OpenAI's Windsurf.
- Google will stop using Scale.

The dream of unfettered MCP interconnects is a mirage.

www.dbreunig.com/2025/06/16/d...

The Drawbridges Go Up

The AI era is speedrunning the Web 2.0 story. Open and accessible MCPs are not our future. Integrations will be tightly governed.

www.dbreunig.com

June 16, 2025 at 5:38 PM

Reposted by Benjamin Lefaudeux 🇺🇦

Hadley Wickham

@hadley.nz

While framed as a critique of Apple’s recent paper, I found this article mostly interesting because it made me think about reasoning in general: mikecaulfield.substack.com/p/the-apple-...

The Apple "Reasoning Collapse" Paper Is Even Dumber Than You Think

We're this far into reasoners and neither hypesters nor skeptics really understand their significance. Also: Read Toulmin.

mikecaulfield.substack.com

June 14, 2025 at 7:01 PM

Benjamin Lefaudeux 🇺🇦

@bentheegg.bsky.social

Self adapting language models, still early but fascinating prospects. There's a dimensionality curse of course: since the dimensions the LLM can touch per generated token are very small as such, needs a massive lever / dimension reduction to be able to self improve.

arxiv.org/pdf/2506.10943

June 14, 2025 at 8:30 AM

Benjamin Lefaudeux 🇺🇦

@bentheegg.bsky.social

Great write up of AMD new offerings, catching the nvidia train on the software side it seems. 3x speedup on MI300X since release, was required but still great to grab

morethanmoore.substack.com/p/amds-ai-fu...

AMD's AI Future is Rack Scale 'Helios'

Key Announcements from AMD Advancing AI 2025

morethanmoore.substack.com

June 12, 2025 at 7:53 PM

Reposted by Benjamin Lefaudeux 🇺🇦

Alex Nichol

@unixpickle.bsky.social

Got nerdsniped into printing this a little while ago.

June 6, 2025 at 4:40 AM

Benjamin Lefaudeux 🇺🇦

@bentheegg.bsky.social

datago now available with webdataset compatibility (streaming tarballs, so you get the data as it arrives). Just pip install datago and give it a whirl if you'd like ? Speed without the dataloader processes, and typical ViT/DiT pre-processing baked in.
example code here github.com/Photoroom/da...

datago/python/benchmark_webdataset.py at main · Photoroom/datago

A Rust-based data loader which can be used from Python. Processing data per sample at GB/s speeds, covering various use cases eventually. - Photoroom/datago

github.com

June 4, 2025 at 8:37 PM

Benjamin Lefaudeux 🇺🇦

@bentheegg.bsky.social

Great link with bsky.app/profile/dbre...

Eugene Vinitsky 🍒 @eugenevinitsky.bsky.social · May 30

Some interesting work suggesting that recent shocking RL+LLM results are due to incorrect baselines and most of the gains are from better format following: safe-lip-9a8.notion.site/Incorrect-Ba...

Incorrect Baseline Evaluations Call into Question Recent LLM-RL Claims | Notion

Authors*: Nikhil Chandak, Shashwat Goel, Ameya Prabhu

safe-lip-9a8.notion.site

May 31, 2025 at 5:31 AM

Benjamin Lefaudeux 🇺🇦

@bentheegg.bsky.social

SageAttention3 paper reads great, and looks like B200s just got a good value boost. QAT or PTQ-free use of FP4, I expected this to be much more complicated or come later to be honest. Only at the attention level and LLMs are most often MLP bottlenecked but stil
arxiv.org/abs/2505.11594

SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blac...

arxiv.org

May 29, 2025 at 9:58 PM

Reposted by Benjamin Lefaudeux 🇺🇦

Anton Obukhov

@obukhov.ai

Big Marigold update!
Last year, we showed how to turn Stable Diffusion 2 into a SOTA depth estimator with a few synthetic samples and 2–3 days on just 1 GPU.
Today's release features:
🏎️ 1-step inference
🔢 New modalities
🫣 High resolution
🧨 Diffusers support
🕹️ New demos
🧶👇

May 15, 2025 at 4:23 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news