Steering vectors are a technique for controlling the behavior of language models by adding specific directions to the model's internal representations. This allows you to encourage certain behaviors (like being more humorous) without fine-tuning!
First, come up with user prompts and corresponding positive and negative assistant resposnes. The positive responses should be examples of how you want the model to behave, and the negative responses should be examples of either the opposite of how you want the model to behave or examples of normal responses. These will be used to calculate the steering vectors by taking the mean difference between the residual stream activations of the positive and negative responses after each layer.
Then, enter a user prompt and see how the model responds to it. You can also change the layer and scaling factor to see how the steering vectors affect the model's responses.
Create a steering vector by providing user prompts with corresponding positive and negative assistant responses. The positive responses should exemplify the desired behavior, and the negative responses should exemplify the opposite or undesired behavior. Check the presets for some examples!