🔬 Mechanism of SAE Feature Index to Semantics

Core Question

The feature index (0-32767) produced by SAE (Sparse Autoencoder) is just a number.
How do we know what semantics it represents?

Answer: SAE features have no predefined semantic labels!

Semantics are not "looked up", but "discovered" through experiments.discovered.

📊 Four Methods to Discover Semantics

Activation Pattern Analysis

Activation Patterns / Top Activations

Observe "which tokens activate this feature" to infer its semantics

Most Common

Feature Comparison

Compare activation differences between two texts to find significantly different features

Comparative Analysis

Steering Intervention

Causal Intervention

Enhance/suppress features and observe changes in model output

Strongest Verification

Automated Naming

Use LLM to summarize Top Activations and automatically name features

Domain-General

🔥 Method 1: Activation Pattern Analysis (Heatmap)

Principle

Each SAE feature is activated on specific inputs. By observing "which tokens activate this feature", we can infer its semantics.

Code Implementation

# qwen_scope_app.py lines 295-413
def feature_heatmap_to_html(tokens, features, top_k, skip_first):
    Heatmap: rows=features, columns=tokens, color=activation strength
    seq_len, sae_width = features.shape
    
    # Calculate average activation value of each feature across all positions
    mean_activation = features.mean(dim=0)  # [sae_width]
    topk_indices = mean_activation.topk(top_k).indices  # [top_k]
    
    # For each Top-K feature, show its activation value at each token position
    for feat_idx in topk_indices:
        row_activations = features[:, feat_idx]  # [seq_len]
        # Color mapping: white→red, normalized by row max

Example: Activation heatmap of feature #1234

Token

The

capital

China

Beijing

Feature #1234

0.1

0.3

0.2

0.9

0.1

0.85

→ This feature activates strongest on "China" and "Beijing" → Inference: may represent"China-related concepts"

⚖️ Method 2: Feature Comparison

Principle

Compare SAE activation differences between two texts to find features that activate strongly in one text but weakly in another.

Example

Text 1
"I like programming, writing code is fun"

Text 2
"I like swimming, exercise is healthy"

Feature Index	Text 1 Activation	Text 2 Activation	Difference	Inferred Semantics
#5678	0.85	0.12	+0.73	May be related to "programming"
#9012	0.10	0.78	-0.68	May be related to "sports"
#3456	0.65	0.62	+0.03	General concept (e.g., "like")

🎯 Method 3: Steering Intervention (Causal Verification)

Principle

This is the strongest semantic verification method.Artificially enhance or suppress a feature's activation and observe changes in model output.

Mathematical Essence

# SAE decoder weights W_dec: [sae_width, d_model]
# W_dec[feat_idx] is the "semantic direction" of this feature in the hidden layer

# Steering operation:
hidden_new = hidden_old + strength × W_dec[feat_idx]

# This is equivalent to moving along the direction of this feature in the "semantic space" of the hidden layer

Code Implementation

# qwen_scope_app.py _steer_hook function
def _steer_hook(module, inp, out):
    hidden = out[0].clone()
    # Key operation: add the feature vector corresponding to SAE decoder weights to hidden state
    hidden[:, pos, :] += strength * sae['W_dec'][feat_idx]
    return (hidden,) + out[1:]

Example

Original
Prompt: "The capital of France is"
Output: "Paris"

After enhancing feature #1234
Prompt: "The capital of France is"
Output: "Beijing" or China-related content
→ Confirms feature #1234 is related to "China" semantics

🤖 Method 4: Automated Naming

⚠️ Qwen-Scope itself does not implement this, but it is a common method in the SAE interpretation field.

Standard Process

① Run the model on a large corpus and collect Top Activations for each feature

↓

② For each feature, collect the 100-1000 tokens/sentences with strongest activations

↓

③ Use LLM to automatically name: "This feature activates strongest on the following texts: [list], please give it a semantic name"

↓

④ Manually review naming results

Example

Top Activations of feature #5678:
"dog" (0.92), "cat" (0.89), "pet" (0.85), "animal" (0.82), "fish" (0.78)...

LLM name: "Pet/animal-related concepts"

📋 Complete Workflow

Step 1: Select feature index (e.g., #12345)

↓

Step 2: Activation pattern analysis → What text triggers it?

↓

Step 3: Feature comparison → In what scenarios does it activate stronger?

↓

Step 4: Steering intervention → Enhance/suppress it, how does model behavior change?

↓

Step 5: Semantic naming → Describe the feature's semantics in natural language

📁 Key Code Location Index

Function	Function Name	Line Number	Description
SAE Feature Calculation	`compute_sae_features`	252-265	Hidden State → SAE Feature
Heatmap Generation	`feature_heatmap_to_html`	295-413	Feature×Token Activation Heatmap
Feature Analysis Callback	`cb_analyze`	506-517	Entry: Analyze feature activations of text
Steering Strength	`_steering_strength_from_mode`	521-545	Map Light/Medium/Strong
Steering Hook	`_steer_hook`	~820	Modify hidden state to inject feature
Feature Comparison	`cb_compare`	1082+	Compare feature differences between two texts
Comparison Result Rendering	`compare_to_html`	859+	Render difference feature table

💡 Analogy

The correspondence from SAE feature index to semantics is likediscovering new elements in chemistry—you can't look up a dictionary to know the properties of element 118, you need to discover it through experiments.

Feature index #12345 itself is just a number, its semantics are defined bywhat inputs it activates onandwhat impact it has on model output.

Based on Qwen-Scope source code analysis | SAE feature interpretation mechanism research