hpatel.dev

Source Code

Steering vectors are a technique for controlling the behavior of language models by adding specific directions to the model's internal representations. This allows you to encourage certain behaviors (like being more humorous) without fine-tuning!

Disclaimer

Steering vectors don't perfectly capture behaviors and are not guaranteed to make the model act exactly as intended. They may also have unintended effects due to variability across inputs, potential biases from contrasting prompts, and limited generalization across different scenarios. This can also happen because of something called superposition. Basically, a single target behavior is probably not represented by simple linear directions in the model's activation space, so itcannot be perfectly represented by a steering vector.

First, come up with user prompts and corresponding positive and negative assistant resposnes. The positive responses should be examples of how you want the model to behave, and the negative responses should be examples of either the opposite of how you want the model to behave or examples of normal responses. These will be used to calculate the steering vectors by taking the mean difference between the residual stream activations of the positive and negative responses after each layer.

Then, enter a user prompt and see how the model responds to it. You can also change the layer and scaling factor to see how the steering vectors affect the model's responses.

Model Name

1. Generate Steering Vector

Create a steering vector by providing user prompts with corresponding positive and negative assistant responses. The positive responses should exemplify the desired behavior, and the negative responses should exemplify the opposite or undesired behavior. Check the presets for some examples!

Preset

Prompt-Response Pairs (1/10)

User Prompt 1

Positive Assistant Response

Negative Assistant Response

Steering Vectors Demo

1. Generate Steering Vector

2. Run Model with SteeringGenerate steering vectors first to unlock this section

2. Run Model with Steering