Justin
banner
justinhjohnson.com
Justin
@justinhjohnson.com
Executive Director @ AstraZeneca | Nexus of Data, Science, Tech | Global Business Leader | Top Data Science Voice | #datascience #AI #buildinpublic #indiehacker

Blog rundatarun.io
BuildInPublic jandsgroupllc.com
A smart cascade for LLM+human decision-making: calibrate confidence, defer to bigger models when needed, abstain to experts when unsure, and learn thresholds online. Big ΔIBC gains on ARC; lower regret in 4/5 online tests. Paper: bit.ly/4qO7eXU
#LLM #AISafety #MLOps
bit.ly
January 8, 2026 at 3:49 PM
8B “ToolOrchestra” trains an RL-orchestrator to route across tools & stronger LLMs—37.1% on HLE vs GPT-5’s 35.1% at big cost savings; open code, model & data. arxiv.org/abs/2511.21689 #AI #LLM #ToolUse
ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration
Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally…
arxiv.org
January 7, 2026 at 1:11 AM
RLMs treat the prompt as data inside a REPL and let the LM recurse on snippets—handling 10M+ tokens and beating long-context baselines on tough tasks. Simple idea, big wins. Paper: arxiv.org/abs/2512.24601 #RLM #LLM #AIResearch
Recursive Language Models
We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference…
arxiv.org
January 5, 2026 at 1:12 AM
DLCM reframes LMs: learn semantic boundaries, reason in a compressed concept space, and decode back to tokens. +2.69% avg on 12 zero-shot tasks at matched FLOPs; new compression-aware scaling law + decoupled μP. Paper: huggingface.co/papers/2512.... #NLP #ScalingLaws #LLMs
Paper page - Dynamic Large Concept Models: Latent Reasoning in an Adaptive Semantic Space
Join the discussion on this paper page
huggingface.co
January 4, 2026 at 12:20 AM
Can LLMs be world models? This arXiv study reframes next-token → next-state, shows strong long-horizon transfer, scaling laws, and real agent gains (verification, synthetic data, RL warm-starts). Read: arxiv.org/abs/2512.18832 #AI #WorldModels #ReinforcementLearning
From Word to World: Can Large Language Models be Implicit Text-based World Models?
Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale. World models offer a…
arxiv.org
January 1, 2026 at 1:53 AM
CTM re-centers time & synchrony in neural nets: per-neuron temporal models + synchronization as the latent rep → adaptive compute, strong maze planning/generalization, calibrated ImageNet, interpretable parity strategies. Read: arxiv.org/abs/2505.05522 #NeurIPS #DeepLearning #AI
Continuous Thought Machines
Biological brains demonstrate complex neural activity, where neural dynamics are critical to how brains process information. Most artificial neural networks ignore the complexity of individual…
arxiv.org
December 28, 2025 at 9:01 AM
Agent-R1 frames RL for agentic LLMs (extended MDP) and ships a modular end-to-end training stack. On multi-hop QA, RL beats RAG/base tool calling with notable gains. Code (MIT) inside. Paper: arxiv.org/abs/2511.14460 #LLMAgents #ReinforcementLearning #NLP
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning
Large Language Models (LLMs) are increasingly being explored for building Agents capable of active environmental interaction (e.g., via tool use) to solve complex problems. Reinforcement Learning…
arxiv.org
December 24, 2025 at 8:10 PM
SAGE: an RL framework that teaches LLM agents to create & reuse executable skills via Sequential Rollout + Skill-integrated Reward. On AppWorld it boosts SGC and slashes tokens vs GRPO. Paper: arxiv.org/abs/2512.17102 #ReinforcementLearning #LLMAgents #SkillLibrary
Reinforcement Learning for Self-Improving Agent with Skill Library
Large Language Model (LLM)-based agents have demonstrated remarkable capabilities in complex reasoning and multi-turn interactions but struggle to continuously improve and adapt when deployed in new…
arxiv.org
December 24, 2025 at 10:06 AM
LaMer brings meta-RL to LLM agents: cross-episode credit + in-context reflection = stronger exploration, better pass@3 & OOD generalization across Sokoban, Minesweeper, Webshop, ALFWorld. Paper: arxiv.org/abs/2512.16848 #MetaRL #LLMAgents #ReinforcementLearning
Meta-RL Induces Exploration in Language Agents
Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents…
arxiv.org
December 23, 2025 at 2:15 AM
JustRL shows a single-stage, fixed-hyperparam RL recipe can push 1.5B math LLMs to SOTA with ~½ the compute—no fancy schedules needed. Smooth training, transferable across backbones, code+models released. www.alphaxiv.org/abs/2512.16649 #ReinforcementLearning #LLMs #NLP
JustRL: Scaling a 1.5B LLM with a Simple RL Recipe
View recent discussion. Abstract: Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules,...
www.alphaxiv.org
December 22, 2025 at 10:30 PM
DeepCode turns papers into production-grade repos via blueprint distillation, code memory, RAG, and closed-loop fixes—posting SOTA on PaperBench and even topping PhD experts on a 3-paper subset. Paper: arxiv.org/abs/2512.07921 #AI #SoftwareEngineering #LLMAgents
DeepCode: Open Agentic Coding
Recent advances in large language models (LLMs) have given rise to powerful coding agents, making it possible for code assistants to evolve into code engineers. However, existing methods still face…
arxiv.org
December 21, 2025 at 7:48 PM
New preprint: Evaluating LLMs in Scientific Discovery introduces SDE—expert-grounded scenarios + project-level tasks (hypotheses, experiments, interpretation). Big gap vs. generic QA; scaling helps less than hoped. Read: arxiv.org/abs/2512.15567 #AI #Science #LLMs
Evaluating Large Language Models in Scientific Discovery
Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge and overlook the iterative reasoning, hypothesis…
arxiv.org
December 21, 2025 at 12:41 PM
New on arXiv: “Learning Dynamics of LLM Finetuning.” A unified view of SFT & DPO reveals a squeezing effect driving confidence decay in off-policy DPO—and a simple SFT tweak that boosts downstream wins. arxiv.org/abs/2407.10490 #LLM #RLHF #MLResearch @arxiv
Learning Dynamics of LLM Finetuning
Learning dynamics, which describes how the learning of specific training examples influences the model's predictions on other examples, gives us a powerful tool for understanding the behavior of deep…
arxiv.org
December 21, 2025 at 8:42 AM
A clean framework for adapting agentic AI: adapt the agent or the tools, with signals from execution or outputs—yielding four practical paradigms + design guidance. Read the survey: huggingface.co/papers/2512.... #AIagents #LLM #MLresearch
Paper page - Adaptation of Agentic AI
Join the discussion on this paper page
huggingface.co
December 20, 2025 at 9:14 AM
Can AI scale by building teams instead of just bigger models? This concept paper maps regimes (debate/collab/coordination), proposes collective scaling laws, and calls for multi-agent pretraining & benchmarks. www.preprints.org/manuscript/2... #LLM #MultiAgent #AIResearch
www.preprints.org
December 18, 2025 at 12:16 AM
ReFusion = diffusion planner + parallel AR infiller at slot level with full KV-cache reuse. On 7 benches: +34% vs prior MDMs, >18× faster; 2.33× faster than strong ARMs while narrowing the gap. arxiv.org/abs/2512.13586 #LLM #Diffusion #NLP
ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational…
arxiv.org
December 16, 2025 at 10:36 AM
3 AM. Jetlag. An idea that wouldn't let me sleep.

What happens when you combine Claude Code Mobile with Gemini 3 Pro Image in a hotel room before sunrise?

Spoiler: You don't just build an app. You build a time machine.

Full story of what emerged from those pre-coffee hours → bit.ly/48zl89U

#AI
3 AM, A Phone, and a Time Machine
Building The Chronoscope Before Coffee
bit.ly
December 13, 2025 at 12:15 PM
SPICE proposes corpus-grounded self-play: one LLM plays Challenger (with docs) and Reasoner (without) to auto-curriculum its way to better reasoning—showing +8.9% (math) and +9.8% (general) gains across models. Read: arxiv.org/abs/2510.24684 #LLM #ReinforcementLearning #NLP
SPICE: Self-Play In Corpus Environments Improves Reasoning
Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts…
arxiv.org
December 10, 2025 at 12:28 AM
November 26, 2025 at 12:02 PM
November 25, 2025 at 5:02 PM
November 25, 2025 at 12:01 PM
I'm bullish on AI in oncology. But we need facts, not feelings.

Be critical of implementation.
Demand transparency.
Engage with the science.

The future isn't choosing between power OR transparency.

It's building both.

Full post: bit.ly/4nYU34d
November 14, 2025 at 10:22 AM
A 2024 study tested 32 AI models for protein interactions. All claimed 90-99% accuracy.

When properly validated? All dropped to 50% (random chance).

They learned shortcuts, not biology.

Interpretability catches this BEFORE deployment. That's why it matters.
November 14, 2025 at 10:22 AM
The capabilities are real:

• AlphaFold: 214M protein structures (Nobel Prize 2024)
• Insilico Medicine: AI drug to Phase IIa in 30 months vs 4-6 years
• IDx-DR: First FDA-approved autonomous AI diagnostic

But here's what matters more: implementation.
November 14, 2025 at 10:22 AM
We're having the wrong conversation about AI in cancer research.

One side: "Trust the black box, it works"
Other side: "Can't trust what we don't understand"

Both miss the point.

The real question: How do we build AI we can understand, validate, and improve?

New post 🧵
November 14, 2025 at 10:22 AM