AutoSteer 🕵️‍♂️

The Feature Hunter Agent for LLM Interpretability

AutoSteer is a mechanistic interpretability agent designed to extract, validate, and steer high-level features in Large Language Models (LLMs) using Activation Addition.

"We don't just find the feature; we prove it's a vector."

🧬 The Philosophy (Rigorous Interrogation)

Unlike simple steering scripts, AutoSteer is built on a foundation of rigorous mechanistic inquiry. We do not blindly add vectors; we interrogate the model's internal physics.

1. The Mechanics

We intervene in the Residual Stream (hook_resid_pre).

Why? The residual stream is the "bandwidth" of the model. Injecting here effects a global state change, whereas injecting into attention heads only alters local token mixing.

2. The Linearity Assumption

We operate on the Linear Representation Hypothesis (Elhage et al., 2022).

Verification: We assume features like "Pirate Persona" exist as linear directions. AutoSteer verifies this by calculating Mean(Positive) - Mean(Neutral) and validating the vector's norm.

3. Safety via Normalization

We enforce Unit Normalization.

The Problem: Raw activation differences can have arbitrary magnitudes. Adding a vector with Norm=100 to a stream with Norm=10 destroys the signal.
The Solution: We normalize our steering vector $\hat{v} = v / |v|$ and introduce an explicit STEERING_COEFF (default: 5.0) to control injection strength precisely.

⚡ Quick Start

Prerequisites

Python 3.10+
uv package manager (recommended for speed) or pip.
Apple Silicon (MPS) or NVIDIA GPU (CUDA) recommended.

Installation

Clone & Setup:

git clone https://github.com/your-username/autosteer.git
cd autosteer
make setup

Running the Agent

This runs src/main.py, which loads gpt2-small, calculates a "Pirate" steering vector, and generates steered text.

make run

Expected Output:

--> [Init] Loading gpt2-small on mps...
--> [Analysis] Extracting vectors from 3 pairs...
--> [Success] Steering vector isolated and normalized.

--> [Gen] Prompt: 'I went to the grocery store and' | Steered: True
    Output: I went to the grocery store and... Arr matey! I plundered the snacks!

🔬 Interactive Deep Dive (Marimo)

For a visual interrogation of the model, run our local Marimo notebook. This is where you can "touch" the math.

uv run marimo edit notebook/deep_dive.py

What you can explore:

Mechanics: Probe the residual stream magnitude.
Linearity: Visualize "Pirate" vs "Neutral" clusters using PCA.
Normalization slider: Real-time adjustment of standard deviation injection.
Layer Sweep: Automatically find which layer has the strongest feature separation.

📂 Project Structure

autosteer/
├── src/
│   └── main.py          # Core Agent (Production Logic)
├── notebook/
│   └── deep_dive.py     # Interactive Visual interrogation (Marimo)
├── .context/            # Project history & tracking
│   ├── changelog.md
│   └── summary.md
├── tests/               # Consistency checks
├── Makefile             # Automation
├── pyproject.toml       # Modern dependency management
└── PRD.md               # Product Requirements & Research Goals

📚 Citations & Theory

Activation Addition: Turner et al. (2023) - Activation Addition: Steering Language Models Without Optimization.
Linear Representation: Elhage et al. (2022) - Toy Models of Superposition.
TransformerLens: Nanda (2022) - The library that makes this possible.

_Built with ❤️ and scientific rigor by Uday Phalak

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
notebook		notebook
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoSteer 🕵️‍♂️

The Feature Hunter Agent for LLM Interpretability

🧬 The Philosophy (Rigorous Interrogation)

1. The Mechanics

2. The Linearity Assumption

3. Safety via Normalization

⚡ Quick Start

Prerequisites

Installation

Running the Agent

🔬 Interactive Deep Dive (Marimo)

📂 Project Structure

📚 Citations & Theory

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AutoSteer 🕵️‍♂️

The Feature Hunter Agent for LLM Interpretability

🧬 The Philosophy (Rigorous Interrogation)

1. The Mechanics

2. The Linearity Assumption

3. Safety via Normalization

⚡ Quick Start

Prerequisites

Installation

Running the Agent

🔬 Interactive Deep Dive (Marimo)

📂 Project Structure

📚 Citations & Theory

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages