AutoSteer is a mechanistic interpretability agent designed to extract, validate, and steer high-level features in Large Language Models (LLMs) using Activation Addition.
"We don't just find the feature; we prove it's a vector."
Unlike simple steering scripts, AutoSteer is built on a foundation of rigorous mechanistic inquiry. We do not blindly add vectors; we interrogate the model's internal physics.
We intervene in the Residual Stream (hook_resid_pre).
- Why? The residual stream is the "bandwidth" of the model. Injecting here effects a global state change, whereas injecting into attention heads only alters local token mixing.
We operate on the Linear Representation Hypothesis (Elhage et al., 2022).
- Verification: We assume features like "Pirate Persona" exist as linear directions. AutoSteer verifies this by calculating
Mean(Positive) - Mean(Neutral)and validating the vector's norm.
We enforce Unit Normalization.
- The Problem: Raw activation differences can have arbitrary magnitudes. Adding a vector with Norm=100 to a stream with Norm=10 destroys the signal.
-
The Solution: We normalize our steering vector
$\hat{v} = v / |v|$ and introduce an explicitSTEERING_COEFF(default: 5.0) to control injection strength precisely.
- Python 3.10+
uvpackage manager (recommended for speed) orpip.- Apple Silicon (MPS) or NVIDIA GPU (CUDA) recommended.
- Clone & Setup:
git clone https://github.com/your-username/autosteer.git cd autosteer make setup
This runs src/main.py, which loads gpt2-small, calculates a "Pirate" steering vector, and generates steered text.
make runExpected Output:
--> [Init] Loading gpt2-small on mps...
--> [Analysis] Extracting vectors from 3 pairs...
--> [Success] Steering vector isolated and normalized.
--> [Gen] Prompt: 'I went to the grocery store and' | Steered: True
Output: I went to the grocery store and... Arr matey! I plundered the snacks!
For a visual interrogation of the model, run our local Marimo notebook. This is where you can "touch" the math.
uv run marimo edit notebook/deep_dive.pyWhat you can explore:
- Mechanics: Probe the residual stream magnitude.
- Linearity: Visualize "Pirate" vs "Neutral" clusters using PCA.
- Normalization slider: Real-time adjustment of standard deviation injection.
- Layer Sweep: Automatically find which layer has the strongest feature separation.
autosteer/
├── src/
│ └── main.py # Core Agent (Production Logic)
├── notebook/
│ └── deep_dive.py # Interactive Visual interrogation (Marimo)
├── .context/ # Project history & tracking
│ ├── changelog.md
│ └── summary.md
├── tests/ # Consistency checks
├── Makefile # Automation
├── pyproject.toml # Modern dependency management
└── PRD.md # Product Requirements & Research Goals
- Activation Addition: Turner et al. (2023) - Activation Addition: Steering Language Models Without Optimization.
- Linear Representation: Elhage et al. (2022) - Toy Models of Superposition.
- TransformerLens: Nanda (2022) - The library that makes this possible.
_Built with ❤️ and scientific rigor by Uday Phalak