Qwen-Scope: Opening the Black Box of LLMs with Sparse Autoencoders

A deep technical analysis of Qwen-Scope — how SAE enables interpretability, feature extraction, and inference control in large language models.

Qwen-Scope Architecture

Background

Behind the powerful capabilities of Large Language Models (LLMs) lies a massive "black box" — we can see inputs and outputs, but struggle to understand how the model internally represents and processes information. Qwen-Scope is a breakthrough tool released by the Qwen team that integrates Sparse Autoencoders (SAE) into the model, allowing us to "see" internal feature activations and even directly control the model's inference behavior.

In this analysis, we used QevosAgent to perform a comprehensive technical study of Qwen-Scope, diving into its architecture design, feature extraction mechanisms, and feature steering principles.

What is Qwen-Scope?

Qwen-Scope is a Gradio-based web application with three core functionalities:

Feature Analysis (Analyze): Input text and see which internal features are activated
Feature Comparison (Compare): Compare feature differences between two texts to find discriminative features
Feature Steering (Steer): Control generation behavior by modifying hidden states

Its core philosophy: Transform SAE from a "post-hoc inspection tool" into a "practical interface for building and fixing language models".

Technical Architecture

Overall Architecture

┌─────────────────────────────────────────────────────┐
│                    Gradio Web UI                     │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐          │
│  │  Analyze  │  │ Compare  │  │  Steer   │          │
│  └─────┬────┘  └─────┬────┘  └─────┬────┘          │
│        │              │              │                │
│  ┌─────┴──────────────┴──────────────┴──────────┐   │
│  │           Core Engine                         │   │
│  │  ┌─────────────┐  ┌──────────────────────┐   │   │
│  │  │ SAE Loader   │  │ Feature Calculator   │   │   │
│  │  │ (LRU Cache) │  │ (compute_sae_features)│   │   │
│  │  └──────┬──────┘  └──────────┬───────────┘   │   │
│  │         │                    │                 │   │
│  │  ┌──────┴────────────────────┴──────────────┐ │   │
│  │  │      Visualization Layer                  │ │   │
│  │  │  (Heatmaps/Probability Distributions)     │ │   │
│  │  └──────────────────────────────────────────┘ │   │
│  └────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────┘

Key Technical Parameters

Parameter	Default	Description
Base Model	Qwen/Qwen3.5-2B	Target language model
SAE Width	32,768	Dictionary size (number of features)
Model Dim	2,048	Hidden layer dimension
Top-K	100	Top K features to display

Core Mechanism: How Features Are Extracted?

1. Hidden State Capture

Qwen-Scope uses PyTorch's Hook mechanism to capture hidden states after the output of a specified Transformer layer:

def capture_hidden(model, input_ids, layer):
    """Capture hidden states at specified layer"""
    hidden_state = None
    def hook(module, inp, out):
        nonlocal hidden_state
        hidden_state = out[0]  # [batch, seq_len, d_model]
    
    hook_handle = model.model.layers[layer].register_forward_hook(hook)
    with torch.no_grad():
        model(input_ids)
    hook_handle.remove()
    return hidden_state

2. SAE Feature Encoding

After capturing hidden states, encode them through the SAE:

def compute_sae_features(hidden, sae, top_k=100):
    x = hidden - sae["b_enc"]           # Subtract bias
    pre_acts = x @ sae["W_enc"]         # Linear encoding [seq_len, sae_width]
    features = topk_relu(pre_acts, top_k)  # Top-K ReLU sparsification
    return features

Key Innovation: Top-K ReLU sparsification ensures only the K strongest activated features are retained at each position, with others zeroed out, making feature representations clearer and more interpretable.

3. Heatmap Visualization

Feature activations are rendered as HTML heatmaps:

Rows = Features (Top-K features sorted by mean activation)
Columns = Token positions
Color = Activation intensity (white → red gradient)

This allows developers to intuitively see which tokens activate which features.

Core Mechanism: How Features Are Controlled?

Steering Principle

Qwen-Scope's most powerful feature is Steering — controlling generation behavior by modifying model hidden states:

Original: Input → Transformer → Hidden State → LM Head → Output

Steering: Input → Transformer → Hidden State → [+Feature Injection] → LM Head → Output

Technical Implementation

def _steer_hook(module, inp, out):
    """Inject feature activations after layer output"""
    hidden = out[0]  # [batch, seq_len, d_model]
    
    for pos in steered_positions:
        for feat_idx in target_features:
            # Get feature direction via decoder weights
            direction = sae["W_dec"][feat_idx]
            # Inject into hidden state with strength
            hidden[:, pos, :] += strength * direction
    
    return (hidden,) + out[1:]

Steering Strength Modes

Mode	Strength	Use Case
Light	25%	Fine-tune generation style
Medium	50%	Noticeably change output
Strong	100%	Force specific features
Custom	User-defined	Precise control

Application Scenarios

1. Steerable Inference Control

Through Steering, control model generation behavior in real-time without modifying model weights:

Enhance "creativity" features for more creative content
Suppress "repetition" features to reduce repetitive generation
Guide the model to focus on specific topics

2. Evaluation Sample Distribution Analysis

Analyze feature activation distributions across different sample categories to understand how the model distinguishes between tasks.

3. Data Classification and Synthesis

Use feature activation patterns to classify data or synthesize new samples with specific feature activations.

4. Model Training and Optimization

Identified problematic features can be used for targeted fine-tuning to solve issues like repetitive generation or hallucinations.

Summary

Qwen-Scope's core innovations:

Complete SAE Toolchain: Feature extraction → visualization → control, forming a closed loop
Efficient Engineering: LRU caching, pre-transposed weights, Top-K ReLU sparsification
Intuitive Visualization: HTML heatmaps + interactive probability panels
Practical Control: Feature steering through hidden state modification via Hooks

Qwen-Scope transforms interpretability from a "research toy" into an "engineering tool", enabling developers to:

Understand how models internally represent different concepts
Control model behavior during inference
Optimize model performance during training

💡 Interactive Demo: Want to explore how semantics map to feature activations? Check out our interactive demo page to see how large language models internally represent and process information.

This article is based on QevosAgent's deep technical analysis of the Qwen-Scope project. Full analysis code and results are available at Qwen-Scope GitHub.