Artificial Intelligence

Intervening in Transformer Representations through Feature Steering and Activation Clamping

August, 2025

Abstract

This project demonstrates how feature steering and activation clamping can be used to causally influence the outputs of a transformer language model. High-level semantic directions, such as sentiment, are identified by contrasting activations from positive and negative prompts and injected into the residual stream to steer generation. Clamping is also applied to fix specific neuron activations for precise control. These techniques support interpretable, fine-grained behavior modulation, contributing to improved model productivity as well as advancements in AI safety and ethical alignment by making model behavior more transparent and controllable.

Submitted to Stanford University in Summer 2025.

I. Introduction

Transformer language models, such as GPT-2, represent language in high-dimensional spaces using a sequence of learned transformations. While these models produce state-of-the-art results, their internal reasoning remains largely opaque. This project explores two techniques—feature steering and activation clamping—to probe and influence model behavior. These interventions aim to make the models more interpretable, controllable, and productive.

II. Feature Steering

Feature steering is the process of biasing the model's internal representation along a specific semantic direction. In this experiment, we focus on sentiment by constructing a direction vector in the residual stream space using two contrasting prompts: "I love this" (positive) and "I hate this" (negative).


# Compute sentiment direction
pos_ids = tokenizer("I love this", return_tensors="pt")["input_ids"].to(device)
neg_ids = tokenizer("I hate this", return_tensors="pt")["input_ids"].to(device)
_, pos_cache = model.run_with_cache(pos_ids)
_, neg_cache = model.run_with_cache(neg_ids)
v_sentiment = (pos_cache[f"blocks.{layer}.hook_resid_post"][0, -1] -
               neg_cache[f"blocks.{layer}.hook_resid_post"][0, -1]).detach()

This vector is then added to the residual stream during text generation to steer the output in a positive direction:


def steering_hook(value, hook):
    value[:, -1, :] += alpha * v_sentiment
    return value

Why Layer 8?

GPT-2 has 12 transformer layers. Research shows that early layers focus on lexical properties while layers 6–10 are semantically rich. Layer 8 is selected because it balances abstraction and influence, making it an effective target for semantic manipulation.

III. Activation Clamping

Clamping involves overriding the activation of a specific neuron to test its causal effect. Unlike steering, which is directional, clamping fixes the value regardless of context.


def clamping_hook(value, hook):
    value[:, -1, neuron_index] = override_value
    return value

In this experiment, a random neuron (index 100) in layer 8 is clamped to a high value (10.0). This demonstrates how a single neuron can disproportionately influence model output.

IV. Results

Both techniques were applied to the same prompt:

Prompt: "The food at the restaurant was"
Output with feature steering: More positively worded completions
Output with clamping: Strong stylistic or semantic shift, sometimes unrelated

V. Conclusion

Feature steering and clamping provide interpretable and causal ways to modify transformer behavior. Steering enables smooth control over semantics, while clamping helps identify influential neurons. Together, these methods contribute to model transparency, productivity, and safe alignment with human intent.

Keywords

Transformer, Feature Steering, Activation Clamping, Interpretability, GPT-2, Residual Stream, AI Safety

Umair Khokhar

Umair Khokhar

Umair Khokhar

Intervening in Transformer Representations through Feature Steering and Activation Clamping

Abstract

I. Introduction

II. Feature Steering

Why Layer 8?

III. Activation Clamping

IV. Results

V. Conclusion

Keywords