πŸ”¬ Mechanism of SAE Feature Index to Semantics

Core Question

The feature index (0-32767) produced by SAE (Sparse Autoencoder) is just a number.
How do we know what semantics it represents?

Answer: SAE features have no predefined semantic labels!

Semantics are not "looked up", but "discovered" through experiments.discovered.

πŸ“Š Four Methods to Discover Semantics

1

Activation Pattern Analysis

Activation Patterns / Top Activations

Observe "which tokens activate this feature" to infer its semantics

Most Common
2

Feature Comparison

Feature Comparison

Compare activation differences between two texts to find significantly different features

Comparative Analysis
3

Steering Intervention

Causal Intervention

Enhance/suppress features and observe changes in model output

Strongest Verification
4

Automated Naming

Automated Naming

Use LLM to summarize Top Activations and automatically name features

Domain-General

πŸ”₯ Method 1: Activation Pattern Analysis (Heatmap)

Principle

Each SAE feature is activated on specific inputs. By observing "which tokens activate this feature", we can infer its semantics.

Code Implementation

# qwen_scope_app.py lines 295-413 def feature_heatmap_to_html(tokens, features, top_k, skip_first): Heatmap: rows=features, columns=tokens, color=activation strength seq_len, sae_width = features.shape # Calculate average activation value of each feature across all positions mean_activation = features.mean(dim=0) # [sae_width] topk_indices = mean_activation.topk(top_k).indices # [top_k] # For each Top-K feature, show its activation value at each token position for feat_idx in topk_indices: row_activations = features[:, feat_idx] # [seq_len] # Color mapping: white→red, normalized by row max

Example: Activation heatmap of feature #1234

Token
The
capital
of
China
is
Beijing
Feature #1234
0.1
0.3
0.2
0.9
0.1
0.85

β†’ This feature activates strongest on "China" and "Beijing" β†’ Inference: may represent"China-related concepts"

βš–οΈ Method 2: Feature Comparison

Principle

Compare SAE activation differences between two texts to find features that activate strongly in one text but weakly in another.

Example

Text 1
"I like programming, writing code is fun"
Text 2
"I like swimming, exercise is healthy"
Feature IndexText 1 ActivationText 2 ActivationDifferenceInferred Semantics
#56780.850.12+0.73May be related to "programming"
#90120.100.78-0.68May be related to "sports"
#34560.650.62+0.03General concept (e.g., "like")

🎯 Method 3: Steering Intervention (Causal Verification)

Principle

This is the strongest semantic verification method.Artificially enhance or suppress a feature's activation and observe changes in model output.

Mathematical Essence

# SAE decoder weights W_dec: [sae_width, d_model] # W_dec[feat_idx] is the "semantic direction" of this feature in the hidden layer # Steering operation: hidden_new = hidden_old + strength Γ— W_dec[feat_idx] # This is equivalent to moving along the direction of this feature in the "semantic space" of the hidden layer

Code Implementation

# qwen_scope_app.py _steer_hook function def _steer_hook(module, inp, out): hidden = out[0].clone() # Key operation: add the feature vector corresponding to SAE decoder weights to hidden state hidden[:, pos, :] += strength * sae['W_dec'][feat_idx] return (hidden,) + out[1:]

Example

Original
Prompt: "The capital of France is"
Output: "Paris"
After enhancing feature #1234
Prompt: "The capital of France is"
Output: "Beijing" or China-related content
β†’ Confirms feature #1234 is related to "China" semantics

πŸ€– Method 4: Automated Naming

⚠️ Qwen-Scope itself does not implement this, but it is a common method in the SAE interpretation field.

Standard Process

β‘  Run the model on a large corpus and collect Top Activations for each feature
↓
β‘‘ For each feature, collect the 100-1000 tokens/sentences with strongest activations
↓
β‘’ Use LLM to automatically name: "This feature activates strongest on the following texts: [list], please give it a semantic name"
↓
β‘£ Manually review naming results

Example

Top Activations of feature #5678: "dog" (0.92), "cat" (0.89), "pet" (0.85), "animal" (0.82), "fish" (0.78)... LLM name: "Pet/animal-related concepts"

πŸ“‹ Complete Workflow

Step 1: Select feature index (e.g., #12345)
↓
Step 2: Activation pattern analysis β†’ What text triggers it?
↓
Step 3: Feature comparison β†’ In what scenarios does it activate stronger?
↓
Step 4: Steering intervention β†’ Enhance/suppress it, how does model behavior change?
↓
Step 5: Semantic naming β†’ Describe the feature's semantics in natural language

πŸ“ Key Code Location Index

FunctionFunction NameLine NumberDescription
SAE Feature Calculationcompute_sae_features252-265Hidden State β†’ SAE Feature
Heatmap Generationfeature_heatmap_to_html295-413FeatureΓ—Token Activation Heatmap
Feature Analysis Callbackcb_analyze506-517Entry: Analyze feature activations of text
Steering Strength_steering_strength_from_mode521-545Map Light/Medium/Strong
Steering Hook_steer_hook~820Modify hidden state to inject feature
Feature Comparisoncb_compare1082+Compare feature differences between two texts
Comparison Result Renderingcompare_to_html859+Render difference feature table

πŸ’‘ Analogy

The correspondence from SAE feature index to semantics is likediscovering new elements in chemistryβ€”you can't look up a dictionary to know the properties of element 118, you need to discover it through experiments.

Feature index #12345 itself is just a number, its semantics are defined bywhat inputs it activates onandwhat impact it has on model output.

Based on Qwen-Scope source code analysis | SAE feature interpretation mechanism research