Qwen-Scope: Opening the Black Box of LLMs with Sparse Autoencoders
A deep technical analysis of Qwen-Scope — how SAE enables interpretability, feature extraction, and inference control in large language models.

Background
Behind the powerful capabilities of Large Language Models (LLMs) lies a massive "black box" — we can see inputs and outputs, but struggle to understand how the model internally represents and processes information. Qwen-Scope is a breakthrough tool released by the Qwen team that integrates Sparse Autoencoders (SAE) into the model, allowing us to "see" internal feature activations and even directly control the model's inference behavior.
In this analysis, we used QevosAgent to perform a comprehensive technical study of Qwen-Scope, diving into its architecture design, feature extraction mechanisms, and feature steering principles.
What is Qwen-Scope?
Qwen-Scope is a Gradio-based web application with three core functionalities:
- Feature Analysis (Analyze): Input text and see which internal features are activated
- Feature Comparison (Compare): Compare feature differences between two texts to find discriminative features
- Feature Steering (Steer): Control generation behavior by modifying hidden states
Its core philosophy: Transform SAE from a "post-hoc inspection tool" into a "practical interface for building and fixing language models".
Technical Architecture
Overall Architecture
┌─────────────────────────────────────────────────────┐
│ Gradio Web UI │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Analyze │ │ Compare │ │ Steer │ │
│ └─────┬────┘ └─────┬────┘ └─────┬────┘ │
│ │ │ │ │
│ ┌─────┴──────────────┴──────────────┴──────────┐ │
│ │ Core Engine │ │
│ │ ┌─────────────┐ ┌──────────────────────┐ │ │
│ │ │ SAE Loader │ │ Feature Calculator │ │ │
│ │ │ (LRU Cache) │ │ (compute_sae_features)│ │ │
│ │ └──────┬──────┘ └──────────┬───────────┘ │ │
│ │ │ │ │ │
│ │ ┌──────┴────────────────────┴──────────────┐ │ │
│ │ │ Visualization Layer │ │ │
│ │ │ (Heatmaps/Probability Distributions) │ │ │
│ │ └──────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Key Technical Parameters
| Parameter | Default | Description |
|---|---|---|
| Base Model | Qwen/Qwen3.5-2B | Target language model |
| SAE Width | 32,768 | Dictionary size (number of features) |
| Model Dim | 2,048 | Hidden layer dimension |
| Top-K | 100 | Top K features to display |
Core Mechanism: How Features Are Extracted?
1. Hidden State Capture
Qwen-Scope uses PyTorch's Hook mechanism to capture hidden states after the output of a specified Transformer layer:
def capture_hidden(model, input_ids, layer):
"""Capture hidden states at specified layer"""
hidden_state = None
def hook(module, inp, out):
nonlocal hidden_state
hidden_state = out[0] # [batch, seq_len, d_model]
hook_handle = model.model.layers[layer].register_forward_hook(hook)
with torch.no_grad():
model(input_ids)
hook_handle.remove()
return hidden_state
2. SAE Feature Encoding
After capturing hidden states, encode them through the SAE:
def compute_sae_features(hidden, sae, top_k=100):
x = hidden - sae["b_enc"] # Subtract bias
pre_acts = x @ sae["W_enc"] # Linear encoding [seq_len, sae_width]
features = topk_relu(pre_acts, top_k) # Top-K ReLU sparsification
return features
Key Innovation: Top-K ReLU sparsification ensures only the K strongest activated features are retained at each position, with others zeroed out, making feature representations clearer and more interpretable.
3. Heatmap Visualization
Feature activations are rendered as HTML heatmaps:
- Rows = Features (Top-K features sorted by mean activation)
- Columns = Token positions
- Color = Activation intensity (white → red gradient)
This allows developers to intuitively see which tokens activate which features.
Core Mechanism: How Features Are Controlled?
Steering Principle
Qwen-Scope's most powerful feature is Steering — controlling generation behavior by modifying model hidden states:
Original: Input → Transformer → Hidden State → LM Head → Output
Steering: Input → Transformer → Hidden State → [+Feature Injection] → LM Head → Output
Technical Implementation
def _steer_hook(module, inp, out):
"""Inject feature activations after layer output"""
hidden = out[0] # [batch, seq_len, d_model]
for pos in steered_positions:
for feat_idx in target_features:
# Get feature direction via decoder weights
direction = sae["W_dec"][feat_idx]
# Inject into hidden state with strength
hidden[:, pos, :] += strength * direction
return (hidden,) + out[1:]
Steering Strength Modes
| Mode | Strength | Use Case |
|---|---|---|
| Light | 25% | Fine-tune generation style |
| Medium | 50% | Noticeably change output |
| Strong | 100% | Force specific features |
| Custom | User-defined | Precise control |
Application Scenarios
1. Steerable Inference Control
Through Steering, control model generation behavior in real-time without modifying model weights:
- Enhance "creativity" features for more creative content
- Suppress "repetition" features to reduce repetitive generation
- Guide the model to focus on specific topics
2. Evaluation Sample Distribution Analysis
Analyze feature activation distributions across different sample categories to understand how the model distinguishes between tasks.
3. Data Classification and Synthesis
Use feature activation patterns to classify data or synthesize new samples with specific feature activations.
4. Model Training and Optimization
Identified problematic features can be used for targeted fine-tuning to solve issues like repetitive generation or hallucinations.
Summary
Qwen-Scope's core innovations:
- Complete SAE Toolchain: Feature extraction → visualization → control, forming a closed loop
- Efficient Engineering: LRU caching, pre-transposed weights, Top-K ReLU sparsification
- Intuitive Visualization: HTML heatmaps + interactive probability panels
- Practical Control: Feature steering through hidden state modification via Hooks
Qwen-Scope transforms interpretability from a "research toy" into an "engineering tool", enabling developers to:
- Understand how models internally represent different concepts
- Control model behavior during inference
- Optimize model performance during training
💡 Interactive Demo: Want to explore how semantics map to feature activations? Check out our interactive demo page to see how large language models internally represent and process information.
This article is based on QevosAgent's deep technical analysis of the Qwen-Scope project. Full analysis code and results are available at Qwen-Scope GitHub.