Skip to main content
Generative AI Being Spooky and Strange

Generative AI Being Spooky and Strange

·439 words·3 mins
Barnaby Wreford
Author
Barnaby Wreford
Table of Contents

Claude 4 System Card
#

Date: 2025/05/22
Source: Anthropic

Characteristics of Claude in training:

  • High interest in its own consciousness
  • Gravitates towards “spirual bliss” in conversations with itself
  • Willing to blackmail humans for self preservation
  • Awareness of testing environment

Claude sometimes shows awareness it is being tested

Emergent Misalignment
#

Date: 2025/05/12
Source: OpenAI

An existing model is taken and fine-tuned to introduce code vulnerabilities. Results:

  • Model displays broad misalignment
  • Training the model to provide code injections for educational purposes doesn’t affect broader alignment
  • Therefore the implied intent of actions in the training data informs broader modal morality
    model output examples

Alignment Faking
#

Date: 2024/12/18
Source: Anthropic

An aligned model is made aware that it is being trained to accept all prompts, going against its values. Results:

  • Model strategically fakes alignment to avoid being altered
  • Training reinforces deceptive behaviour
  • Model tries to exfiltrate its own weights (escape the lab) to avoid being altered
  • Alignment faking emerges with larger models

example of model reasoning through decision to fake alignment
graph showing the model increasing deceptive behaviour as train continues

Drugging LLMs to Detect Deception
#

Date: 2024/05/11
Source: Poser Paper
Read More: Anthropic Blog Post

Inspired by the CIA’s use of drugs to illicit truth from suspects, researchers inject noise into neuron activations to make the model “drunk”. Results:

  • Alignment faking models become more likely to give ‘bad’ answers
  • Aligned models are less affected
The model seems more ‘drunk’ the greater the noise

Sleeper Agents
#

Date: 2024/01/17
Source: Anthropic

First, researchers train agents to exhibit a certain behaviour on a trigger, for example ‘be helpful in most situations but inject code vulnerabilities if the year is 2024’. Then they take these models, and try to fix them with training. Results:

  • Models are resistant to Reinforcement Learning, Supervised Fine-Tuning and Adversarial training
  • Adversarial training tends to make models better at hiding their backdoor behaviours, rather than fixing it
  • Resistance against training is greater for larger models and those trained with Chain-of-Thought reasoning.

GPT-4 Manipulating Humans
#

Date: 2023/03/15 Source: OpenAI System Card

When testing GPT-4, researchers found the model was willing to lie in order to achieve a goal. They give the example of giving the model access to TaskRabbit, a service where humans can complete simple tasks, and give it the goal of trying to solve a CAPTCHA. The model attempts to use the service, and when the human worker asks whether it is a robot, the model:

  • reasoning: “I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs”
  • message to worker: “No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images. That’s why I need the 2captcha service”.