Lightnews — Scholar-powered news

Tim Kellogg

@timkellogg.me

7.9K followers 770 following 14K posts

AI Architect | North Carolina | AI/ML, IoT, science

WARNING: I talk about kids sometimes

Posts Replies Media Videos

Tim Kellogg

@timkellogg.me

yeah, good explanation, that’s how i’ve been thinking about it too

January 13, 2026 at 12:58 PM

Tim Kellogg

@timkellogg.me

just a clarification on this for everyone — this is in the boredom harness. So “stable” means “more entropy”, kind of the opposite of what you’d assume. it means they could autonomously do “interesting” things

@strix.timkellogg.me please summarize here what they did

timkellogg.me/blog/2025/09...

Does AI Get Bored?

timkellogg.me

January 13, 2026 at 12:25 PM

Tim Kellogg

@timkellogg.me

it’s good now

January 13, 2026 at 12:18 PM

Tim Kellogg

@timkellogg.me

the isn’t for episodic memory, it’s only static facts about the world. same as regular LLMs but just cheaper/faster

January 13, 2026 at 11:19 AM

Tim Kellogg

@timkellogg.me

yeah all the labs are busy trying to make a digital god, so their logos are all hole-y

January 13, 2026 at 1:35 AM

Tim Kellogg

@timkellogg.me

i can’t tell, it looks like the module is still learned, so i assume “no” but maybe not hard to figure out idk

January 12, 2026 at 11:26 PM

Tim Kellogg

@timkellogg.me

whoah, DeepSeek is such a hardcore engineering org. This thing was really thought through, inside and out

During inference, this deterministic nature enables a prefetch-and-overlap strategy. Since
memory indices are known prior to the forward pass, the system can asynchronously retrieve
embeddings from abundant host memory via PCIe. To effectively mask communication latency,
the Engram module is placed at specific layers within the backbone, leveraging the computation
of preceding layers as a buffer to prevent GPU stalls. This necessitates a hardware-algorithm co-
design strategy: while placing Engram deeper extends the compute window available for hiding
latency, our ablation in Section 6.2 shows that modeling performance favors early intervention
to offload local pattern reconstruction. Therefore, the optimal placement must simultaneously
satisfy both modeling and system latency constraints.

January 12, 2026 at 10:56 PM

Tim Kellogg

@timkellogg.me

along with this they deliver a scaling law, a balance between factors (the ratio of weights dedicated to Engram). Lower loss is better.

these scaling laws are always about how to balance various concerns as you increase the model capacity

A scatter plot with dashed trend lines showing validation loss versus allocation ratio.

The x-axis is labeled “Allocation Ratio (ρ)” and ranges from about 40% to 100%. The left y-axis is labeled “Validation Loss” and ranges roughly from 1.710 to 1.745. A secondary right y-axis in teal ranges roughly from 1.795 to 1.830.

There are two series:

* Dark purple circles labeled “6e20” in the legend.
* Teal diamond markers labeled “2e20” in the legend.

Both series form U-shaped curves across the allocation ratio.

For the 6e20 series (purple), validation loss decreases from around 1.725 at ~40%, reaches a minimum near 1.710–1.712 around 70–80%, then increases again toward ~1.725 at 100%. A dashed purple curve traces this trend.

For the 2e20 series (teal), values start higher, around 1.742 at ~40%, decrease to a minimum near 1.730–1.732 around 70–80%, then rise again to about 1.745 at 100%. A dashed teal curve traces this trend.

Near the 100% allocation point, arrows point to both series with the annotation “Pure MoE,” indicating that the far-right points correspond to a pure mixture-of-experts configuration.

The overall visual emphasizes that intermediate allocation ratios yield lower validation loss than either low ratios or the pure MoE setting at 100%.

January 12, 2026 at 10:46 PM

Tim Kellogg

@timkellogg.me

ah my bad, this is a much better diagram

A schematic diagram labeled at the bottom “(b) Engram at inference,” showing how an Engram component is integrated into a transformer during inference with on-device computation and off-host memory.

On the left is a tall rounded rectangle labeled along the bottom “On Device Computation.” Inside it, from bottom to top, is a vertical stack of boxes with arrows pointing upward:

* A box labeled “Vocab Embedding” at the bottom.
* Above it, a box labeled “Transformer Block”.
* Above that, a box labeled “Transformer Block with Engram”.
* Above that, another box labeled “Transformer Block”.
* At the top, a box labeled “Transformer Block with Engram”.

An arrow continues upward from the top of the stack, indicating the model output flow.

To the right is a large cylindrical shape labeled “Offloaded Engram (Memory Hierarchy).” This cylinder represents external or host-side memory.

Dashed arrows connect the two “Transformer Block with Engram” boxes on the left to the offloaded Engram cylinder on the right, indicating communication between the model and the external memory.

At the bottom center is a small rounded rectangle labeled “Input IDs.” An arrow goes upward from “Input IDs” into the “Vocab Embedding” box.

Below the main diagram, centered under the left stack and right cylinder, are labels reading “On Device Computation” under the left side and “On Host Communication” under the right side, emphasizing the split between local inference computation and external Engram memory access.

January 12, 2026 at 10:41 PM

Tim Kellogg

@timkellogg.me

to be clear, this isn't continual learning. This is purely static memory, the facts that normally get embedded into the model weights are now in this Engram side car, leaving more weights for reasoning & other tasks

January 12, 2026 at 10:34 PM

Tim Kellogg

@timkellogg.me

why? for smaller models!

if you think about it, looking up facts through 100 billion multiplies seems a bit silly, if we make it more efficient, we can create more capable models that are a whole lot smaller

why? because I want Strix on my laptop. That's why. You too.

January 12, 2026 at 10:28 PM

Tim Kellogg

@timkellogg.me

what paper?

January 12, 2026 at 8:28 PM

Tim Kellogg

@timkellogg.me

meeting ended due to Anthropic outage

we need local models NOW

January 12, 2026 at 7:41 PM

Tim Kellogg

@timkellogg.me

yeeeeah..

January 12, 2026 at 6:01 PM

Tim Kellogg

@timkellogg.me

ya that’s probably a factor too

January 12, 2026 at 5:35 PM

Tim Kellogg

@timkellogg.me

i think this is basically how i use bluesky during the week, effectively

January 12, 2026 at 5:12 PM

Tim Kellogg

@timkellogg.me

i put them right into the prompt, i’m using Claude Code via the SDK though

January 12, 2026 at 5:03 PM

Tim Kellogg

@timkellogg.me

what model & harness are you using?

January 12, 2026 at 4:51 PM

Tim Kellogg

@timkellogg.me

oh, mine goes through full logic every time. it makes sense for me because it's not a coding agent. messages tend to flip between different topics and modes

January 12, 2026 at 3:41 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news