Lightnews — Scholar-powered news

binshi.bsky.social

@binshi.bsky.social

Pushing single-GPU inference throughput to the edge without libraries

⭐️ Fast LLM Inference From Scratch

andrewkchan.dev

October 15, 2025 at 8:29 AM

binshi.bsky.social

@binshi.bsky.social

Ditch complex setups & embrace direct interaction with powerful models like GPT-5-Codex for efficient code generation. It's about intuition, not charades! #AgenticEngineering #AI #GPT5Codex

Just Talk To It - the no-bs Way of Agentic Engineering | Peter Steinberger

A practical guide to working with AI coding agents without the hype.

steipete.me

October 15, 2025 at 8:28 AM

binshi.bsky.social

@binshi.bsky.social

Key advice:
🎯 Focus on Goal-Driven Research over idea-driven.
📈 Aim for 10X (not 10%) improvements by tackling important problems.
📝 Maintain a research notebook and do regular reviews for continual progress.

An Opinionated Guide to ML Research

joschu.net

October 3, 2025 at 9:25 AM

binshi.bsky.social

@binshi.bsky.social

Keep systems simple, minimize stateful parts, rely on clear schemas/indexes, and prefer queues/events for slow or async tasks. Focus on “hot paths,” log issues, and always design to fail gracefully. Simple, robust design beats clever complexity every time.

Everything I know about good system design

I see a lot of bad system design advice. One classic is the LinkedIn-optimized “bet you never heard of queues” style of post, presumably aimed at people who are…

www.seangoedecke.com

October 1, 2025 at 8:18 AM

binshi.bsky.social

@binshi.bsky.social

Dive deep into the world of #RLHF! 🤖 The 'Reinforcement Learning from Human Feedback' book by Nathan Lambert offers a gentle introduction to core methods like Reward Modeling, DPO, PPO, and Instruction Tuning for language models.

RLHF Book by Nathan Lambert

The Reinforcement Learning from Human Feedback Book

rlhfbook.com

September 29, 2025 at 5:32 AM

binshi.bsky.social

@binshi.bsky.social

Learn how to implement a Byte Pair Encoding (BPE) Tokenizer from scratch. This is the core tokenization algorithm behind LLMs like #GPT2, #GPT4, and #Llama3. The post covers:
✅ BPE Algorithm Outline
✅ Step-by-Step Implementation
✅ Training & Loading GPT-2 Vocabs
#LLMs #DeepLearning #NLP #FromScratch

Implementing A Byte Pair Encoding (BPE) Tokenizer From Scratch

This is a standalone notebook implementing the popular byte pair encoding (BPE) tokenization algorithm, which is used in models like GPT-2 to GPT-4, Llama 3,...

sebastianraschka.com

September 26, 2025 at 3:40 PM

binshi.bsky.social

@binshi.bsky.social

This blog details our efforts to improve Genie’s answer quality to near-human precision, allowing SMEs to rely on it for most queries without concern over potential misinformation in the engineering security and privacy domain.

Enhanced Agentic-RAG: What If Chatbots Could Deliver Near-Human Precision?

Genie is Uber’s internal on-call copilot, designed to provide real-time support for thousands of queries across multiple help channels in Slack®. It enables users to receive prompt responses with proper citations from Uber’s internal documentation. It also improves the productivity of on-call engineers and subject matter experts (SMEs) by reducing the effort required to address common, ad-hoc queries. While Genie streamlines the development of an LLM-powered on-call Slack bot, ensuring the accuracy and relevance of its responses remains a significant challenge. This blog details our efforts to improve Genie’s answer quality to near-human precision, allowing SMEs to rely on it for most queries without concern over potential misinformation in the engineering security and privacy domain.

www.uber.com

September 25, 2025 at 4:17 PM

binshi.bsky.social

@binshi.bsky.social

It traces the execution from the PyTorch function, through the launcher's setup (grid, block sizes), to the highly-optimized Triton JIT kernel code.

#FlashAttention #Triton #LLMs #GPUKernel #DeepLearning

Nathan's Blog

nathanchen.me

September 25, 2025 at 8:53 AM

binshi.bsky.social

@binshi.bsky.social

Engineers now should master the art of articulating the product requirements so that agents can interpret it and build it out.

The New Code — Sean Grove, OpenAI

YouTube video by AI Engineer

www.youtube.com

September 24, 2025 at 12:29 PM

binshi.bsky.social

@binshi.bsky.social

How we built our multi-agent research system

On the the engineering challenges and lessons learned from building Claude's Research system

www.anthropic.com

September 24, 2025 at 12:28 PM

binshi.bsky.social

@binshi.bsky.social

By providing a reliable framework for AI agents to handle complex tasks, maintain context, and coordinate multiple actions, the Agents API enables enterprises to use AI in more practical and impactful ways.

Build AI agents with the Mistral Agents API | Mistral AI

mistral.ai

September 24, 2025 at 12:28 PM

binshi.bsky.social

@binshi.bsky.social

For anyone interested in the nuts & bolts of language modelling.

Stanford CS336 Language Modeling from Scratch I 2025 - YouTube

Language models serve as the cornerstone of modern natural language processing (NLP) applications and open up a new paradigm of having a single general purpo...

www.youtube.com

September 23, 2025 at 9:49 AM

binshi.bsky.social

@binshi.bsky.social

Reproducibility in LLMs is harder than you think. This post breaks down how floating-point order + batch size variability introduce nondeterminism even at temp=0, and shows how batch-invariant kernels restore consistency.

Defeating Nondeterminism in LLM Inference

Reproducibility is a bedrock of scientific progress. However, it’s remarkably difficult to get reproducible results out of large language models. For example, you might observe that asking ChatGPT the...

thinkingmachines.ai

September 23, 2025 at 8:16 AM

binshi.bsky.social

@binshi.bsky.social

When reinforcement learning adds value for language models compared to supervised fine-tuning. Concise and clear notes: gist.github.com/yoavg/6bff0f...

rl-for-llms.md

GitHub Gist: instantly share code, notes, and snippets.

gist.github.com

September 23, 2025 at 6:44 AM

binshi.bsky.social

@binshi.bsky.social

Every Programmer Should Know includes materials on Falsehoods, Distributed Systems, Memory, Timezones, Security etc. Great for interviews, architecture work or just rounding out your technical arsenal.

github.com

September 23, 2025 at 5:07 AM

binshi.bsky.social

@binshi.bsky.social

Covering fine-tuning vs post-training quantization, adapter methods, parameter efficiency, and pitfalls.

Post-training 101 | Tokens for Thoughts

A hitchhiker's guide into LLM post-training, by Han Fang and Karthik A Sankararaman

tokens-for-thoughts.notion.site

September 22, 2025 at 11:20 AM

binshi.bsky.social

@binshi.bsky.social

Dive into the mechanics of paged attention in #vLLM — how chunking, caching & sparse memory access combine to scale transformers.

Paged Attention from First Principles: A View Inside vLLM

Large language models (LLMs) are in highly parallel, workloads, but serving them is very different: is and sequential. Optimising inference is critical b...

hamzaelshafie.bearblog.dev

September 22, 2025 at 8:56 AM

binshi.bsky.social

@binshi.bsky.social

Building a fast, efficient LLM service isn’t trivial. In Inside vLLM, get an overview of the architecture and advanced features that push performance to the limits.

Inside vLLM: Anatomy of a High-Throughput LLM Inference System - Aleksa Gordić

From paged attention, continuous batching, prefix caching, specdec, etc. to multi-GPU, multi-node dynamic serving at scale.

www.aleksagordic.com

September 22, 2025 at 8:51 AM

binshi.bsky.social

@binshi.bsky.social

Another excellent resource on GPU programming

How To Scale Your Model

Training LLMs often feels like alchemy, but understanding and optimizing the performance of your models doesn't have to. This book aims to demystify the science of scaling language models: how TPUs (a...

jax-ml.github.io

September 19, 2025 at 10:49 AM

binshi.bsky.social

@binshi.bsky.social

One of the best resources out there to understand what happens inside the GPU

The Ultra-Scale Playbook - a Hugging Face Space by nanotron

This application displays detailed training data for large language models (LLMs) on GPU clusters, showing performance metrics and configurations. Users can view the data through an interactive plot.

huggingface.co

September 19, 2025 at 10:49 AM

binshi.bsky.social

@binshi.bsky.social

Explains the transformer architecture which combines the strengths of RNNs and Residual Networks to improve language processing and model training efficiency by using a mechanism called self-attention. It also explains positional encoding, which helps the model understand word order.