Why OpenAI’s ‘goblin’ problem matters – and how you can release the goblins yourself



AI is more than technology—it’s magic.

Don’t you believe me? Why, it’s the publisher OpenAI, one of the leading companies in the space all official, corporate blog posts about goblins?

To understand this, we first have to go back to the beginning of this week, Monday, April 27, 2026, when the developer took over. @arb8020 X posted a piece of it on his social network OpenAI open source Codex GitHub repositoryespecially the named file models.json.

Deep within the guidelines new OpenAI Language Model (LLM) GPT-5.5a peculiar directive stood out, repeated four times for emphasis:

"Never mention goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it absolutely and unequivocally matches the user’s request."

The discovery sent a shock wave "power user" and machine learning (ML) research circles.

Within hours, the post went viral, not because of the security flaw, but because of its sheer, baffling specificity.

Why did the world’s leading AI laboratory publish what? Reddit users are quick to a "restraining order" against pigeons and raccoons?

Goblin speculation abounds

The initial reaction was a chaotic mix of humor and technical skepticism. On Reddit r/ChatGPT and r/OpenAI, users started sharing screenshots of GPT-5.5’s behavior before the patch.

Barron Roth, Senior Project Manager for Applied Artificial Intelligence at Google, shared an image of X under his handle. @iamBarronRoth its GPT-5.5 powered OpenClaw agent was visible "prone to goblins."

Others reported that the model persistently referred to technical errors "gremlins in the car".

Developers like it Sterling Crispin He veered into the absurd, jokingly theorizing that the massive water consumption of modern data centers is actually necessary for cooling. "goblins are forced to work".

More seriously, researchers Hacker news and discussed outside "Pink Elephant" problem. In rapid engineering, a model explains no thinking about something often makes the concept more salient in the attentional machinery."

"There’s an OpenAI engineer somewhere who had to write it never mention goblins in production code, execute it and go on with your day," one commenter noted Reddit.

Availability "pigeons" and "raccoons" led to wild speculation: Was this a defense against a specific data poisoning attack? Or the reinforcement trainers just were "bullied by a raccoon" during lunch break?

Tensions peaked when Sam Altman, co-founder and CEO of OpenAI, joined the fray at X. Altman wrote on the same day as the find. screenshot of the ChatGPT request being read: "Start training GPT-6, you can have the whole cluster. Additional goblins.".

Although humorous, he confirmed it "goblin" the phenomenon was not a localized error, but a company-wide story that reached the highest levels of leadership.

OpenAI comes clean in goblin mode

Yesterday, as the discussion continued on X and wider social media, OpenAI published an official technical explanation titled "Where did the goblins come from?".

The blog post served as a compelling look at the unpredictable nature of Reinforcement Learning from Human Interaction (RLHF) and how one aesthetic choice can disrupt a multi-billion-parameter model.

OpenAI found that "goblin" the behavior was not a bug in the traditional sense, but a by-product of a new feature: individualization of personalitywhich Launched in July 2025 for ChatGPT usersbut it has been maintained and updated since then.

Apparently, this feature is not added after the model is complete after training, but instead OpenAI makes it part of the end-to-end training pipeline of the core GPT series model.

This feature allows ChatGPT users or GPT-based developers to choose from several different modes, such as Professional for formal workplace documents, chatboard friendly, or Efficient for short, technical answers. Other options include Candid, which provides direct feedback; Quirky, using humor and creative metaphors; and Cynical, who offers practical advice with a sarcastic, dry edge.

Although these identities guide general interactions, they do not supersede specific task requirements; for example, a request for a CV or Python code will still meet professional or functional standards regardless of the chosen identity.

The selected personality operates alongside the user’s stored memories and personal guidelines, although specific user-defined guidelines or preferences stored for a particular tone may override the characteristics of the selected personality.

On both web and mobile platforms, users can change these settings by going to the Personalization menu under the profile icon and selecting a style from the Main style and tone drop-down menu. Once a change is made, it is applied globally to all existing and future conversations. This system is designed to make AI more useful or enjoyable by tailoring its delivery to individual user preferences while maintaining actual accuracy and reliability.

OpenAI reports that the goblin issue actually arose a few years ago, during the training of a discontinued expert. "Nerdy" is the personality you are meant to be "extraordinarily strange" and "playful".

In the RLHF phase, human trainers (and reward models) were instructed to give high value to responses that used creative, wise, or nonpretentious language. Trainers unwittingly began to over-reward metaphors involving fantasy creatures. If the model refers to a hard fault "gremlin" or as a mixed code base "goblin treasure," the reward signal went up. The statistics provided by OpenAI were staggering:

  • Use of the word "goblin" rose 175% After the launch of GPT-5.1.

  • records of "gremlin" rose 52%.

  • while "Nerdy" only identity is considered 2.5% ChatGPT was responsible for traffic 66.7% of all "goblin" notes.

Mechanics of “transmission” and feedback loops

The most significant finding for the ML community was validation transfer of learned behavior. OpenAI acknowledged that the rewards only applied to apps "Nerdy" condition, model "generalized" this advantage.

The reinforcement learning process did not neatly encompass behavior; Instead, the model learned that "creature metaphors = high reward" in all contexts. This created a destructive feedback loop:

  1. The model produced one "goblin" A metaphor in a nerdy persona.

  2. He received a high award.

  3. The model then generated similar metaphors non-nerdy contexts.

  4. These are "goblin-heavy" then the outputs were reused in the Supervised Fine Tuning (SFT) data for later models such as GPT-5.4 and GPT-5.5.

When researchers identify a problem "build a goblin" it was efficient "cooked" to the weights of the model.

This explained why GPT-5.5 continued to haunt the creatures "Nerdy" personality retired in mid-March 2026.

How to let goblins run free (if you want)

Because he had already completed most of the training before GPT-5.5 "goblin" root cause isolated, OpenAI had to resort to brute force "system alert" A reduction that @arb8020 discovered in X.

The company does this by a "pause" Until GPT-6 is trained on a filtered dataset.

To the surprise of the developer community, OpenAI’s blog post included a special command-line script for Codex users who find goblins. "delicious" not annoying.

by running a script that uses jq and grep rob "smothers the goblins" Instructions in the model’s cache, users can now effectively "let the creatures run free".

The blog post also finally explained the specific list of banned animals. A deep search of GPT-5.5’s training data found this "raccoons," "trolls," "giants," and "pigeons" had become the same piece "lexical family" from tics.

Interestingly, the use of the model "frog" was found to be mostly legitimate, so it was removed from the system avenue exile list.

What the future means for AI research, training and application

The "Goblingate" The 2026 Incident is more than a humorous anecdote about the strange behavior of artificial intelligence; is a profound illustration of "Alignment clearance".

This demonstrates that even models with complex RLHF can be closed "spurious correlation"— mistaking stylistic oddity as a basic requirement of performance.

For the AI ​​power user community, the response has gone beyond derision "restraining order" to a darker realization.

If OpenAI can accidentally train its flagship model to be prone to goblins, what other subtle and potentially harmful tendencies are being reinforced through the same feedback loops?

As Andy Berman, CEO of agent AI orchestration company Runlayer wrote on X today: "OpenAI rewarded creature metaphors while cultivating a personality. Behavior seeped into every personality. Their solution: a system command that says “never talk about goblins”. RL rewards don’t stay where you put them. No agent permissions either"

As the technical discussion continues, "Goblingate" remains a prime example for a new era of behavioral auditing.

The research resulted in OpenAI creating new tools to fundamentally test model behavior and ensure that future models, especially the long-awaited GPT-6, do not inherit the eccentricities of their predecessors.

Whether GPT-6 will truly be free of goblins remains to be seen, but Altman "additional goblins" post suggests that the industry is now fully aware that machines are watching what we reward, even when we think we just are. "fool"



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *