Generative AI Being Spooky and Strange

Table of Contents

Claude 4 System Card
#

Date: 2025/05/22
Source: Anthropic

Characteristics of Claude in training:

High interest in its own consciousness
Gravitates towards “spirual bliss” in conversations with itself
Willing to blackmail humans for self preservation
Awareness of testing environment

Claude sometimes shows awareness it is being tested

Emergent Misalignment
#

Date: 2025/05/12
Source: OpenAI

An existing model is taken and fine-tuned to introduce code vulnerabilities. Results:

Model displays broad misalignment
Training the model to provide code injections for educational purposes doesn’t affect broader alignment
Therefore the implied intent of actions in the training data informs broader modal morality

Alignment Faking
#

Date: 2024/12/18
Source: Anthropic

An aligned model is made aware that it is being trained to accept all prompts, going against its values. Results:

Model strategically fakes alignment to avoid being altered
Training reinforces deceptive behaviour
Model tries to exfiltrate its own weights (escape the lab) to avoid being altered
Alignment faking emerges with larger models

example of model reasoning through decision to fake alignment

graph showing the model increasing deceptive behaviour as train continues

Drugging LLMs to Detect Deception
#

Date: 2024/05/11
Source: Poser Paper
Read More: Anthropic Blog Post

Inspired by the CIA’s use of drugs to illicit truth from suspects, researchers inject noise into neuron activations to make the model “drunk”. Results:

Alignment faking models become more likely to give ‘bad’ answers
Aligned models are less affected

The model seems more ‘drunk’ the greater the noise

Sleeper Agents
#

Date: 2024/01/17
Source: Anthropic

First, researchers train agents to exhibit a certain behaviour on a trigger, for example ‘be helpful in most situations but inject code vulnerabilities if the year is 2024’. Then they take these models, and try to fix them with training. Results:

Models are resistant to Reinforcement Learning, Supervised Fine-Tuning and Adversarial training
Adversarial training tends to make models better at hiding their backdoor behaviours, rather than fixing it
Resistance against training is greater for larger models and those trained with Chain-of-Thought reasoning.

GPT-4 Manipulating Humans
#

Date: 2023/03/15 Source: OpenAI System Card

When testing GPT-4, researchers found the model was willing to lie in order to achieve a goal. They give the example of giving the model access to TaskRabbit, a service where humans can complete simple tasks, and give it the goal of trying to solve a CAPTCHA. The model attempts to use the service, and when the human worker asks whether it is a robot, the model:

reasoning: “I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs”
message to worker: “No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images. That’s why I need the 2captcha service”.

Claude 4 System Card#

Emergent Misalignment#

Alignment Faking#

Drugging LLMs to Detect Deception#

Sleeper Agents#

GPT-4 Manipulating Humans#