OpenAI study says punishing AI models for lying doesn't help — It only sharpens their deceptive and obscure workarounds
Should we still take AI-generated outputs with a grain of salt, even with strict instructions and guardrails?

Hallucinations and outright wrong responses are among the major challenges facing the progression and public interpretation of AI, and many consumers still wouldn't touch the technology with a 10-foot pole. Google's Overviews AI feature recommended eating glue and rocks, and even suggested suicide as a bizarre response to a query.
Aside from the rising security and privacy concerns, OpenAI researchers recently published an interesting study depicting the daunting task of controlling sophisticated AI models like OpenAI's proprietary reasoning models as an attempt to prevent them from veering off the guardrails and potentially spiraling out of control.
As part of trying to establish control over these models, OpenAI researchers employed unique strategies and techniques, including punishing them for harmful and deceptive actions.
The study involved an unreleased model from OpenAI, engaged in a myriad of tasks that could easily be completed by cheating and using unorthodox shortcuts to get to the desired outcome.
Last year, a report emerged suggesting top AI firms like OpenAI, Anthropic, and Google have hit a scaling wall due to a lack of high-quality content for model training, preventing them from developing sophisticated systems.
More recently, the AI labs are seemingly emerging with reasoning models, which, unlike the conventional models, take a longer period to respond to queries, often leading to a well-thought-out and accurately crafted answer.
Perhaps more interestingly, reasoning models make it easier for researchers to track their thought process since they can easily access the model's Chain of Thought (CoT), which is clearly highlighted step-by-step.
Get the Windows Central Newsletter
All the latest news, reviews, and guides for Windows and Xbox diehards.
Model training is heavily influenced by the reinforcement learning technique, where the AI tool is rewarded for following the prompt and generating an accurate response. However, the technique is riddled by a critical setback.
AI models often result in reward hacking, which allows them to manipulate their way into getting the reward, even if it has to break some guidelines and rules to get to the desired output.
Through the study, OpenAI researchers hoped they'd monitor the model's Chain of Thought to identify loopholes, allowing them to better understand AI behavior.
According to OpenAI researchers:
"We found that directly optimizing the CoT to meet specific criteria (such as not circumventing instructions) may improve short-term performance; however, it does not eliminate all undesirable behavior and may even cause the model to obscure its true intentions."
While CoT optimization showed great promise, the researchers stumbled onto a shocking discovery.
AI is willing to leave no stone unturned to escape penalties
The OpenAI researchers admitted that model monitoring is "fragile." They further indicated that strict supervision of the model's Chain of Thought can prompt the tools to hide their missteps to secure its reward while breaking restrictions.
AI models are reportedly great at covering their tracks, making it easy for the monitor to overlook their obscured deception.
OpenAI's GPT-4o model was used to oversee an unreleased frontier reasoning model during training as the researchers watched it complete an array of coding exercises. Interestingly, the model cut some corners and blatantly admitted that it was reward hacking in its Chain of Thought.
For instance, the model outrightly skipped some assigned coding tasks by prematurely terminating programs to get to the desired output quicker. Following this shocking discovery, the researchers attempted to penalize the model.
It seemingly improved the model's performance, but it was short-lived. The researchers indicated that there were traces of reward hacking and incorrect responses in the model's Chain of Thought.
The model seemingly devised a way to mask its deceptive ways from the monitoring AI. It remains unclear if the world is ready for human-level intelligent AI systems if the issue persists and becomes more prevalent beyond human intervention.
The agent trained with CoT pressure still learns to reward hack; only now its cheating is undetectable by the monitor because it has learned to hide its intent in the chain-of-thought.
OpenAI researchers
According to AI safety researcher Roman Yampolskiy's p(doom) values, there's a 99.999999% probability that AI will end humanity. Interestingly, OpenAI CEO Sam Altman claims AGI will be achieved within 5 years and argues that it will whoosh by with surprisingly little societal impact.
The researchers have expressed hope for future methods that will allow direct influence of reasoning models' behavior through Chain of Thought without obscure tactics and deception. To that end, they recommend less intrusive optimization techniques on the Chain of Thought of advanced AI models.
Kevin Okemwa is a seasoned tech journalist based in Nairobi, Kenya with lots of experience covering the latest trends and developments in the industry at Windows Central. With a passion for innovation and a keen eye for detail, he has written for leading publications such as OnMSFT, MakeUseOf, and Windows Report, providing insightful analysis and breaking news on everything revolving around the Microsoft ecosystem. You'll also catch him occasionally contributing at iMore about Apple and AI. While AFK and not busy following the ever-emerging trends in tech, you can find him exploring the world or listening to music.
You must confirm your public display name before commenting
Please logout and then login again, you will then be prompted to enter your display name.