binshi.bsky.social
@binshi.bsky.social
Pushing single-GPU inference throughput to the edge without libraries
⭐️ Fast LLM Inference From Scratch
andrewkchan.dev
October 15, 2025 at 8:29 AM
Ditch complex setups & embrace direct interaction with powerful models like GPT-5-Codex for efficient code generation. It's about intuition, not charades! #AgenticEngineering #AI #GPT5Codex
Just Talk To It - the no-bs Way of Agentic Engineering | Peter Steinberger
A practical guide to working with AI coding agents without the hype.
steipete.me
October 15, 2025 at 8:28 AM
Key advice:
🎯 Focus on Goal-Driven Research over idea-driven.
📈 Aim for 10X (not 10%) improvements by tackling important problems.
📝 Maintain a research notebook and do regular reviews for continual progress.
An Opinionated Guide to ML Research
joschu.net
October 3, 2025 at 9:25 AM
Keep systems simple, minimize stateful parts, rely on clear schemas/indexes, and prefer queues/events for slow or async tasks. Focus on “hot paths,” log issues, and always design to fail gracefully. Simple, robust design beats clever complexity every time.
Everything I know about good system design
I see a lot of bad system design advice. One classic is the LinkedIn-optimized “bet you never heard of queues” style of post, presumably aimed at people who are…
www.seangoedecke.com
October 1, 2025 at 8:18 AM
Dive deep into the world of #RLHF! 🤖 The 'Reinforcement Learning from Human Feedback' book by Nathan Lambert offers a gentle introduction to core methods like Reward Modeling, DPO, PPO, and Instruction Tuning for language models.
RLHF Book by Nathan Lambert
The Reinforcement Learning from Human Feedback Book
rlhfbook.com
September 29, 2025 at 5:32 AM
Learn how to implement a Byte Pair Encoding (BPE) Tokenizer from scratch. This is the core tokenization algorithm behind LLMs like #GPT2, #GPT4, and #Llama3. The post covers:
✅ BPE Algorithm Outline
✅ Step-by-Step Implementation
✅ Training & Loading GPT-2 Vocabs
#LLMs #DeepLearning #NLP #FromScratch
Implementing A Byte Pair Encoding (BPE) Tokenizer From Scratch
This is a standalone notebook implementing the popular byte pair encoding (BPE) tokenization algorithm, which is used in models like GPT-2 to GPT-4, Llama 3,...
sebastianraschka.com
September 26, 2025 at 3:40 PM
It traces the execution from the PyTorch function, through the launcher's setup (grid, block sizes), to the highly-optimized Triton JIT kernel code.

#FlashAttention #Triton #LLMs #GPUKernel #DeepLearning
Nathan's Blog
nathanchen.me
September 25, 2025 at 8:53 AM
Engineers now should master the art of articulating the product requirements so that agents can interpret it and build it out.
The New Code — Sean Grove, OpenAI
YouTube video by AI Engineer
www.youtube.com
September 24, 2025 at 12:29 PM
By providing a reliable framework for AI agents to handle complex tasks, maintain context, and coordinate multiple actions, the Agents API enables enterprises to use AI in more practical and impactful ways.
Build AI agents with the Mistral Agents API | Mistral AI
mistral.ai
September 24, 2025 at 12:28 PM
For anyone interested in the nuts & bolts of language modelling.
Stanford CS336 Language Modeling from Scratch I 2025 - YouTube
Language models serve as the cornerstone of modern natural language processing (NLP) applications and open up a new paradigm of having a single general purpo...
www.youtube.com
September 23, 2025 at 9:49 AM
Reproducibility in LLMs is harder than you think. This post breaks down how floating-point order + batch size variability introduce nondeterminism even at temp=0, and shows how batch-invariant kernels restore consistency.
Defeating Nondeterminism in LLM Inference
Reproducibility is a bedrock of scientific progress. However, it’s remarkably difficult to get reproducible results out of large language models. For example, you might observe that asking ChatGPT the...
thinkingmachines.ai
September 23, 2025 at 8:16 AM
When reinforcement learning adds value for language models compared to supervised fine-tuning. Concise and clear notes: gist.github.com/yoavg/6bff0f...
rl-for-llms.md
GitHub Gist: instantly share code, notes, and snippets.
gist.github.com
September 23, 2025 at 6:44 AM
Every Programmer Should Know includes materials on Falsehoods, Distributed Systems, Memory, Timezones, Security etc. Great for interviews, architecture work or just rounding out your technical arsenal.
github.com
September 23, 2025 at 5:07 AM
Covering fine-tuning vs post-training quantization, adapter methods, parameter efficiency, and pitfalls.
Post-training 101 | Tokens for Thoughts
A hitchhiker's guide into LLM post-training, by Han Fang and Karthik A Sankararaman
tokens-for-thoughts.notion.site
September 22, 2025 at 11:20 AM
Dive into the mechanics of paged attention in #vLLM — how chunking, caching & sparse memory access combine to scale transformers.
Paged Attention from First Principles: A View Inside vLLM
Large language models (LLMs) are in highly parallel, workloads, but serving them is very different: is and sequential. Optimising inference is critical b...
hamzaelshafie.bearblog.dev
September 22, 2025 at 8:56 AM
Building a fast, efficient LLM service isn’t trivial. In Inside vLLM, get an overview of the architecture and advanced features that push performance to the limits.
Inside vLLM: Anatomy of a High-Throughput LLM Inference System - Aleksa Gordić
From paged attention, continuous batching, prefix caching, specdec, etc. to multi-GPU, multi-node dynamic serving at scale.
www.aleksagordic.com
September 22, 2025 at 8:51 AM
Another excellent resource on GPU programming
How To Scale Your Model
Training LLMs often feels like alchemy, but understanding and optimizing the performance of your models doesn't have to. This book aims to demystify the science of scaling language models: how TPUs (a...
jax-ml.github.io
September 19, 2025 at 10:49 AM
One of the best resources out there to understand what happens inside the GPU
The Ultra-Scale Playbook - a Hugging Face Space by nanotron
This application displays detailed training data for large language models (LLMs) on GPU clusters, showing performance metrics and configurations. Users can view the data through an interactive plot.
huggingface.co
September 19, 2025 at 10:49 AM
Explains the transformer architecture which combines the strengths of RNNs and Residual Networks to improve language processing and model training efficiency by using a mechanism called self-attention. It also explains positional encoding, which helps the model understand word order.
Transformers and Self-Attention (DL 19)
YouTube video by Professor Bryce
www.youtube.com
September 18, 2025 at 6:44 AM
vLLM: Fast LLM Serving with PagedAttention, which partitions the Key-Value (KV) cache to solve memory inefficiencies and boost throughput.
Fast LLM Serving with vLLM and PagedAttention
YouTube video by Anyscale
www.youtube.com
September 18, 2025 at 6:23 AM