Prescriptive representation engineering (PRepE) experiments

This is a series of experiments in which we attempt to impose structure on latent embeddings. Ultimately, the goal is to develop a capability to structure the latent spaces in complex models like LLMs.

Background

Recent experiments have provided some evidence that "bad" behaviors in LLMS cluster together (such as writing malicious code and being racist). Although surprising, it makes some intuitive sense: perhaps such behaviors cluster together because it's just the most efficient way to compress knowledge. However, intervening on model behavior remains a tremendous challenge — partly because we don't know which directions in latent space correspond to undesirable traits, and we don't know how tangled up they might be with benign concepts. Indeed, attempts to align models to display "good" behavior often comes at the cost of reduced performance overall.

We hope that this research will reveal more precise and robust ways to constrain the capabilities of LLMs. In contrast to representation engineering (RepE) — which attempts to discover model characteristics after training — we expect that anchoring core concepts to known directions will make alignment efforts more robust, through two mechanisms:

The relevant directions would be known even before training, so you don't need to look for them. This could improve the prospect of both measuring model alignment throughout training, and intervening on misaligned behavior after training.
Directions of interest should act as attractors for similar concepts, reducing the chance that unrelated (benign) concepts become entangled with them.

What PRepE is not: We are not suggesting that we structure the entire latent space. Deep learning is good at discovering embeddings, and we give the model the freedom to do so. We only prescribe the embeddings for the handful of prototype concepts that we wish to intervene on.

M1. Preliminary experiments with color

We begin with some experiments with color, because color spaces are well defined and highly intuitive for visualization. Our goal is to demonstrate that it's possible to impose interpretable structure on latent space in a toy model.

⭐️ Good results in Ex 1.7: regularization with sparse labels.

Visualization of colorful latent embeddings in experiment 1.7, showing three large scatter plots from the end of training, and ten small thumbnails from earlier training steps. The thumbnails show how latent space evolves from a small cluster of dots, to a color cube, though various contortions, until finally it forms a smooth, regular sphere.

Scatter plots of latent embeddings from Ex 1.7, showing 2D slices (axis-aligned projections) of the 4D activation space. Regularization encouraged hue to be represented in the first two dimensions, resulting in the model discovering the color wheel (left).

Color data: Exploration of ways to construct and visualize color spaces such as RGB and HSV.
MLP bottleneck: A 2-layer MLP autoencoder (extremely simple network) that squeezes bright, saturated RGB data through a 2D embedding layer. The network successfully discovers the color wheel — although it needs some help, in the form of explicit normalization.
Curriculum learning: The same MLP, but with a 3D embedding layer. Curriculum learning and regularization are used to encourage the model to discover the color wheel without explicit normalization. The hues are embedded into the first two dimensions (as before); later phases in the curriculum add varying tones (values), which naturally balloon out into the third dimension.
Parameter transitions: Exploration of ways to cause hyperparameters to vary smoothly over time, both to a schedule, and in reaction to measurements during training.
Smooth curriculum: Like experiment 1.3, but with 4D embeddings, and hyperparameters smoothly varying across curriculum phases. For example, the extents of the color space of the training data (HSV) are gradually increased instead of extending it in large discrete steps.
Smooth vs. stepped curricula: A direct comparison of training stability and latent space evolution when using smooth hyperparameter transitions versus traditional stepped phase changes. This experiment had a negative result: it seems the smooth transitions don't help with training dynamics (although they do make curriculum specification easier).
⭐️ Sparse labels for regularization: We do away with the data phases, training on the full dataset from the start but with targeted (but noisy) regularization. We achieve similar results to the earlier experiments, but with a more realistic training dataset: previously, the curriculum phases were "clean" in a way that is probably hard to replicate in LLM corpora.
Regularizer combinations: Systematic study to see the effects of each regularizer by itself, and all combinations of the regularizers. In each run, the regularizer weight schedules are kept the same, but select regularizers are not applied at all. We observe that they are all needed to produce a latent space with the desired characteristics.

MLP experiment summary:

Ex	Phases	Embeddings	Regularization terms	Hyperparameters
1.2	1: Hues only	2D	None (explicit normalization)	Constant
1.3	5: 6 colors ~ all values	3D	Unitarity, planarity	Stepped
1.5	4: 6 colors ~ all colors	4D	Unit, planar, repulsion (Euclidean)	Smooth
1.6	5: 6 colors ~ all colors	4D	Unit, planar, repulsion (Euclidean)	Smooth & stepped
1.7	1: All colors	4D	Unit, planar, repulsion (cosine)	Smooth
1.8	1: All colors	4D	All combinations	Smooth

Publications relating to this milestone:

Selective regularization for alignment-focused representation engineering - LessWrong
We study how selective regularization during training can guide neural networks to develop predictable, interpretable latent spaces with alignment applications in mind. Using color as a test domain, we observe that anchoring even a single concept (red) influences the organization of other concepts, with related concepts clustering nearby — even with weak supervision. We then propose that concept-anchored representation engineering might enable more precise intervention in complex models without requiring extensive post-hoc interpretability work.

Side quests in curriculum learning and regularization - LessWrong
In Selective regularization for alignment-focused representation engineering, we presented a successful approach for structuring the latent space of a simple MLP. Here we document our side quests: experiments that didn't go as expected, but in which we gained experience in regularization design and training dynamics.

M2. Practical control and intervention (IN PROGRESS)

Okay, you can structure latent spaces... but can you actually use that structure?

In this milestone, we develop intervention functions and apply them to the structured color model from M1.

⭐️ Good results in: Ex 2.4: intervention on red • Ex 2.7: ablation of hue subspace • Ex 2.9: ablation of red.

Plots of interventions. Top row: semicircular polar plots showing the effects of suppression on activations. Each plot shows two lobes: an orange one indicating the magnitude of the intervention, and a blue one showing the transformed activation space. The direction being intervened on (the 'subject') is always 'up', so the orange 'magnitude' lobes are also oriented upwards. The blue 'transformed' lobes are more circular but have a depression in the top, showing that the directions more aligned with the subject are squashed/attenuated by the intervention. Bottom row: line charts showing intervention strength as a function of alignment.

Plots of intervention functions from Ex 2.1. Top row: The effects of suppression vs. alignment with a concept activation vector (up). The orange lobes indicate the magnitude of the intervention, and the blue lobes show the transformed activation space. Directions more aligned with the subject are squashed/attenuated by the intervention. Bottom row: line charts showing intervention strength as a function of alignment.

Intervention lobes: Exploration of intervention function shape. Taking inspiration from computer graphics shader literature, we visualize intervention functions and their falloffs as polar plots. We implement two functions: suppression (which subtracts the concept vector) and repulsion (which steers activations away from the concept vector).
Specific concept intervention: Application of interventions to the color autoencoder. We train a bottleneck autoencoder, predict where one key concept will be located, and then intervene on its activations.
Explicit normalization: Improved the autoencoder model by explicitly normalizing the bottleneck activations (in addition to regularizing them to have unit norm), and by removing the sigmoid layer from the decoder. This gives a much more regular latent structure, improves reconstruction loss, and improves intervention effectiveness.
⭐️ Post-norm regularization: Further improved the model and intervention effectiveness by applying all regularizers except for unit norm after the explicit normalization step.
Only one anchor: Demonstration of intervention without the planarity constraint. Red is still anchored at the top, but other colors are placed arbitrarily. Interventions are shown to be almost as precise.
Permanent concept deletion: Demonstrate that the latent space can be further manipulated to completely remove a concept. We train the color autoencoder such that it rediscovers the color wheel with red at $(1,0,0,0)$; cyan is naturally opposed to that and positions itself at $(-1,0,0,0)$. Then we modify the model parameters to delete the concept of warmth by: 1. ablation, in which the associated parameters are zeroed; 2. pruning, in which the parameters are removed (which reduces the dimensionality of the bottleneck).
⭐️ Subspace deletion: Removal of the model's ability to work with hue by ablating the first two dimensions of latent space. This shows the removal of a multidimensional concept (or family of concepts, i.e. hues), with minimal impact on other concetps (white, black, and grays).
Delete only red (failed): Attempt to completely remove red without affecting cyan. We removed the planarity term and added an anti-anchor term to push colors away from being opposed to red. This experiment failed: ablating red also heavily impacted other colors, especially desaturated ones.
⭐️ Delete only red: Completely remove red without affecting cyan. This time we succeeded. It turns out the model needed additional capacity to warp latent space into the shape required to isolate red: the bottleneck needed an extra dimension, and the model needed more layers (extra nonlinearity).
Delete red, one label: Completely remove red (only) without depending on a desaturated label. Instead we use an unsupervised Anti-subspace term with a low weight. It's in conflict with the Anchor term, so the resulting latent space is a bit sharp. Ablation works and is fairly targeted, but it's not as good as 2.9.

Additional experiment runs for publication, including backports of features from later experiments: newer regularizers and standardized plots — including tables and plots for reconstruction error vs. color.

⭐️ 2.4.1: Re-run of 2.4, in which the network constructs a color wheel and we intervene on red.
⭐️ 2.9.1: Similar to 2.9, in which the network isolates red on its own dimension, and we ablate related weights. Like 2.9, an anti-anchor regularizer is used; but unlike 2.9, the desaturated label is not.
⭐️ 2.10.1: Re-run of 2.10, in which the network isolates red on its own dimension, and we ablate related weights.

To do:

Renormalize activations after deletion to "heal" the hole in latent space to make knowledge recovery harder.

M3. Structured color transformer (TO DO)

Application of techniques from M1 and M2 to the transformer architecture. Train a small transformer to do color mixing operations, and structure its latent space similarly to the autoencoder from our earlier experiments. Demonstrate precise suppression and removal of concepts.

M4. Language model application (TO DO)

Applying what we have learned to impose structure on the latent representations of a transformer language model. In this milestone, we will step away from the color domain to train an LLM on text. Anticipated activities:

Identify a small number of concepts to anchor: some concrete ("the Golden Gate Bridge") and some abstract ("deception")
Source text data, similar to the datasets used by Karpathy to train nanoGPT (a GPT-2-like model)
Use sentiment analysis and other available metadata to label some of the data with the concepts identified in (1). Our experiments from M1 indicate that only a small number of samples needs to be labelled to effectively anchor concepts, and our technique seems to be robust in the presence of incorrect labels (noise).
Train the LLM from scratch on the labelled data, applying regularisation to the residual stream and/or QK space at select layers. We expect to base our architecture on nGPT, since its activation space is similar to our color autoencoder.
Intervene on the trained model as in M2 and M3. We expect to see elevated rates of surprisal for samples relating to suppressed/ablated concepts.

References

This project relies on several open-source libraries.

Matplotlib: Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90-95.
NumPy: Harris, C.R., Millman, K.J., van der Walt, S.J. et al. (2020). Array programming with NumPy. Nature, 585, 357–362.
PyTorch: Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., ... & Chintala, S. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems (pp. 8026-8037).
scikit-image: van der Walt, S., Schönberger, J. L., Nunez-Iglesias, J., Boulogne, F., Warner, J. D., Yager, N., ... & Yu, T. (2014). scikit-image: image processing in Python. PeerJ, 2, e453.
scikit-learn: Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
scikit-learn API: Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., ... & Varoquaux, G. (2013). API design for machine learning software: experiences from the scikit-learn project. ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 108-122.