Abstract
This project demonstrates how feature steering and activation clamping can be used to causally influence the outputs of a transformer language model. High-level semantic directions, such as sentiment, are identified by contrasting activations from positive and negative prompts and injected into the residual stream to steer generation. Clamping is also applied to fix specific neuron activations for precise control. These techniques support interpretable, fine-grained behavior modulation, contributing to improved model productivity as well as advancements in AI safety and ethical alignment by making model behavior more transparent and controllable.
Submitted to Stanford University in Summer 2025.
I. Introduction
Transformer language models, such as GPT-2, represent language in high-dimensional spaces using a sequence of learned transformations. While these models produce state-of-the-art results, their internal reasoning remains largely opaque. This project explores two techniques—feature steering and activation clamping—to probe and influence model behavior. These interventions aim to make the models more interpretable, controllable, and productive.
II. Feature Steering
Feature steering is the process of biasing the model's internal representation along a specific semantic direction. In this experiment, we focus on sentiment by constructing a direction vector in the residual stream space using two contrasting prompts: "I love this" (positive) and "I hate this" (negative).
# Compute sentiment direction
pos_ids = tokenizer("I love this", return_tensors="pt")["input_ids"].to(device)
neg_ids = tokenizer("I hate this", return_tensors="pt")["input_ids"].to(device)
_, pos_cache = model.run_with_cache(pos_ids)
_, neg_cache = model.run_with_cache(neg_ids)
v_sentiment = (pos_cache[f"blocks.{layer}.hook_resid_post"][0, -1] -
neg_cache[f"blocks.{layer}.hook_resid_post"][0, -1]).detach()
This vector is then added to the residual stream during text generation to steer the output in a positive direction:
def steering_hook(value, hook):
value[:, -1, :] += alpha * v_sentiment
return value
Why Layer 8?
GPT-2 has 12 transformer layers. Research shows that early layers focus on lexical properties while layers 6–10 are semantically rich. Layer 8 is selected because it balances abstraction and influence, making it an effective target for semantic manipulation.
III. Activation Clamping
Clamping involves overriding the activation of a specific neuron to test its causal effect. Unlike steering, which is directional, clamping fixes the value regardless of context.
def clamping_hook(value, hook):
value[:, -1, neuron_index] = override_value
return value
In this experiment, a random neuron (index 100) in layer 8 is clamped to a high value (10.0). This demonstrates how a single neuron can disproportionately influence model output.
IV. Results
Both techniques were applied to the same prompt:
- Prompt: "The food at the restaurant was"
- Output with feature steering: More positively worded completions
- Output with clamping: Strong stylistic or semantic shift, sometimes unrelated
V. Conclusion
Feature steering and clamping provide interpretable and causal ways to modify transformer behavior. Steering enables smooth control over semantics, while clamping helps identify influential neurons. Together, these methods contribute to model transparency, productivity, and safe alignment with human intent.
Keywords
Transformer, Feature Steering, Activation Clamping, Interpretability, GPT-2, Residual Stream, AI Safety