π¬ Mechanism of SAE Feature Index to Semantics
Core Question
The feature index (0-32767) produced by SAE (Sparse Autoencoder) is just a number.
How do we know what semantics it represents?
Answer: SAE features have no predefined semantic labels!
Semantics are not "looked up", but "discovered" through experiments.discovered.
π Four Methods to Discover Semantics
1Activation Pattern Analysis
Activation Patterns / Top Activations
Observe "which tokens activate this feature" to infer its semantics
Most Common
2Feature Comparison
Feature Comparison
Compare activation differences between two texts to find significantly different features
Comparative Analysis
3Steering Intervention
Causal Intervention
Enhance/suppress features and observe changes in model output
Strongest Verification
4Automated Naming
Automated Naming
Use LLM to summarize Top Activations and automatically name features
Domain-General
π₯ Method 1: Activation Pattern Analysis (Heatmap)
Principle
Each SAE feature is activated on specific inputs. By observing "which tokens activate this feature", we can infer its semantics.
Code Implementation
def feature_heatmap_to_html(tokens, features, top_k, skip_first):
seq_len, sae_width = features.shape
mean_activation = features.mean(dim=0)
topk_indices = mean_activation.topk(top_k).indices
for feat_idx in topk_indices:
row_activations = features[:, feat_idx]
Example: Activation heatmap of feature #1234
Feature #1234
0.1
0.3
0.2
0.9
0.1
0.85
β This feature activates strongest on "China" and "Beijing" β Inference: may represent"China-related concepts"
βοΈ Method 2: Feature Comparison
Principle
Compare SAE activation differences between two texts to find features that activate strongly in one text but weakly in another.
Example
Text 1
"I like programming, writing code is fun"
Text 2
"I like swimming, exercise is healthy"
| Feature Index | Text 1 Activation | Text 2 Activation | Difference | Inferred Semantics |
| #5678 | 0.85 | 0.12 | +0.73 | May be related to "programming" |
| #9012 | 0.10 | 0.78 | -0.68 | May be related to "sports" |
| #3456 | 0.65 | 0.62 | +0.03 | General concept (e.g., "like") |
π― Method 3: Steering Intervention (Causal Verification)
Principle
This is the strongest semantic verification method.Artificially enhance or suppress a feature's activation and observe changes in model output.
Mathematical Essence
hidden_new = hidden_old + strength Γ W_dec[feat_idx]
Code Implementation
def _steer_hook(module, inp, out):
hidden = out[0].clone()
hidden[:, pos, :] += strength * sae['W_dec'][feat_idx]
return (hidden,) + out[1:]
Example
Original
Prompt: "The capital of France is"
Output: "Paris"
After enhancing feature #1234
Prompt: "The capital of France is"
Output: "Beijing" or China-related content
β Confirms feature #1234 is related to "China" semantics
π€ Method 4: Automated Naming
β οΈ Qwen-Scope itself does not implement this, but it is a common method in the SAE interpretation field.
Standard Process
β Run the model on a large corpus and collect Top Activations for each feature
β
β‘ For each feature, collect the 100-1000 tokens/sentences with strongest activations
β
β’ Use LLM to automatically name: "This feature activates strongest on the following texts: [list], please give it a semantic name"
β
β£ Manually review naming results
Example
"dog" (0.92), "cat" (0.89), "pet" (0.85), "animal" (0.82), "fish" (0.78)...
π Complete Workflow
Step 1: Select feature index (e.g., #12345)
β
Step 2: Activation pattern analysis β What text triggers it?
β
Step 3: Feature comparison β In what scenarios does it activate stronger?
β
Step 4: Steering intervention β Enhance/suppress it, how does model behavior change?
β
Step 5: Semantic naming β Describe the feature's semantics in natural language
π Key Code Location Index
| Function | Function Name | Line Number | Description |
| SAE Feature Calculation | compute_sae_features | 252-265 | Hidden State β SAE Feature |
| Heatmap Generation | feature_heatmap_to_html | 295-413 | FeatureΓToken Activation Heatmap |
| Feature Analysis Callback | cb_analyze | 506-517 | Entry: Analyze feature activations of text |
| Steering Strength | _steering_strength_from_mode | 521-545 | Map Light/Medium/Strong |
| Steering Hook | _steer_hook | ~820 | Modify hidden state to inject feature |
| Feature Comparison | cb_compare | 1082+ | Compare feature differences between two texts |
| Comparison Result Rendering | compare_to_html | 859+ | Render difference feature table |
π‘ Analogy
The correspondence from SAE feature index to semantics is likediscovering new elements in chemistryβyou can't look up a dictionary to know the properties of element 118, you need to discover it through experiments.
Feature index #12345 itself is just a number, its semantics are defined bywhat inputs it activates onandwhat impact it has on model output.
Based on Qwen-Scope source code analysis | SAE feature interpretation mechanism research