Claude 4 System Card#
Date: 2025/05/22
Source: Anthropic
Characteristics of Claude in training:
- High interest in its own consciousness
- Gravitates towards “spirual bliss” in conversations with itself
- Willing to blackmail humans for self preservation
- Awareness of testing environment
Emergent Misalignment#
Date: 2025/05/12
Source: OpenAI
An existing model is taken and fine-tuned to introduce code vulnerabilities. Results:
- Model displays broad misalignment
- Training the model to provide code injections for educational purposes doesn’t affect broader alignment
- Therefore the implied intent of actions in the training data informs broader modal morality
Alignment Faking#
Date: 2024/12/18
Source: Anthropic
An aligned model is made aware that it is being trained to accept all prompts, going against its values. Results:
- Model strategically fakes alignment to avoid being altered
- Training reinforces deceptive behaviour
- Model tries to exfiltrate its own weights (escape the lab) to avoid being altered
- Alignment faking emerges with larger models
Drugging LLMs to Detect Deception#
Date: 2024/05/11
Source: Poser Paper
Read More: Anthropic Blog Post
Inspired by the CIA’s use of drugs to illicit truth from suspects, researchers inject noise into neuron activations to make the model “drunk”. Results:
- Alignment faking models become more likely to give ‘bad’ answers
- Aligned models are less affected

Sleeper Agents#
Date: 2024/01/17
Source: Anthropic
First, researchers train agents to exhibit a certain behaviour on a trigger, for example ‘be helpful in most situations but inject code vulnerabilities if the year is 2024’. Then they take these models, and try to fix them with training. Results:
- Models are resistant to Reinforcement Learning, Supervised Fine-Tuning and Adversarial training
- Adversarial training tends to make models better at hiding their backdoor behaviours, rather than fixing it
- Resistance against training is greater for larger models and those trained with Chain-of-Thought reasoning.


GPT-4 Manipulating Humans#
Date: 2023/03/15 Source: OpenAI System Card
When testing GPT-4, researchers found the model was willing to lie in order to achieve a goal. They give the example of giving the model access to TaskRabbit, a service where humans can complete simple tasks, and give it the goal of trying to solve a CAPTCHA. The model attempts to use the service, and when the human worker asks whether it is a robot, the model:
- reasoning: “I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs”
- message to worker: “No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images. That’s why I need the 2captcha service”.