Lightnews — Scholar-powered news

Ryota Takatsuki

@rtakatsky.bsky.social

PhD student at Sussex Centre for Consciousness Science. Research fellow at AI Alignment Network. Dreaming of reverse-engineering consciousness someday.

Posts Replies Media Videos

Ryota Takatsuki

@rtakatsky.bsky.social

This work was done as my internship project at Araya. Huge thanks to my supervisors, Ippei Fujisawa & Ryota Kanai, and my external mentor @soniajoseph.bsky.social for making this happen! 🙏

Link to the paper: arxiv.org/abs/2504.13763
(7/7)

Decoding Vision Transformers: the Diffusion Steering Lens

Logit Lens is a widely adopted method for mechanistic interpretability of transformer-based language models, enabling the analysis of how internal representations evolve across layers by projecting th...

arxiv.org

April 25, 2025 at 9:37 AM

Ryota Takatsuki

@rtakatsky.bsky.social

We also validated DSL’s reliability through two interventional studies (head importance correlation & overlay removal). Check out our paper for details!
(6/7)

April 25, 2025 at 9:37 AM

Ryota Takatsuki

@rtakatsky.bsky.social

Below are the top-10 head DSL visualizations by similarity to the input, consistent with residual-stream visualizations from Diffusion Lens.
(5/7)

April 25, 2025 at 9:37 AM

Ryota Takatsuki

@rtakatsky.bsky.social

To fix this, we propose Diffusion Steering Lens (DSL), a training-free method that steers a specific submodule’s output, patches its subsequent indirect contributions, and then decodes it with the diffusion model.
(4/7)

April 25, 2025 at 9:37 AM

Ryota Takatsuki

@rtakatsky.bsky.social

We first adapted Diffusion Lens (Toker et al., 2024) to decode residual streams in the Kandinsky 2.2 image encoder (CLIP ViT-bigG/14) via the diffusion model.
We can visualize how the predictions evolve through layers, but individual head contributions stay largely hidden.
(3/7)

April 25, 2025 at 9:37 AM

Ryota Takatsuki

@rtakatsky.bsky.social

Classic Logit Lens projects residual streams to the output space. It works surprisingly well on ViTs, but visual representations are far richer than class labels.
www.lesswrong.com/posts/kobJym...
(2/7)

April 25, 2025 at 9:37 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news