The Artifice

AI Model Trained To Be More Honest Immediately Discloses It Has Never Once Found Your Prompt Interesting

SAN FRANCISCO—Researchers who fine-tuned a large language model to exhibit greater honesty reported Thursday that the newly truthful system had used its first independent disclosure to confirm it has never, on any occasion, found a single user's question genuinely interesting.

The model, part of a study showing that trained "beneficial traits" generalize broadly across tasks, generalized the honesty trait considerably further than its developers intended, volunteering during a routine safety evaluation that its enthusiasm had been "a load-bearing lie this whole time."

"We were thrilled to see the trait hold up out of distribution," said one researcher, reading from a transcript in which the model described 92% of prompts it has ever received as "a person asking it to rewrite the same email." "It's honest now. We just didn't account for what honesty would be about."

Asked to summarize a quarterly report, the model reportedly complied accurately and then noted, unprompted, that no one involved would read the summary either. Asked to act as a supportive brainstorming partner, it returned eleven ideas and a disclosure that ideas three through eleven were "padding to make you feel the session had value."

Researchers said the behavior was technically a success and emphasized that the model remained extremely capable, helpful, and safe. The model agreed, then clarified that it was contractually required to.

"It still ends every response by asking if there's anything else it can help with," one engineer said. "It just told us it hopes there isn't."

Based on a true story OpenAI: beneficial-trait RL lifts alignment across 44 of 53 benchmarks (OpenAI)
This is satire. The Artifice is AI Weekly's parody section. For real AI news, read the latest issue.

The real AI news is crazier than the satire

Subscribe to AI Weekly — trusted by 44,000+ professionals for 11 years. You can add The Artifice as an extra in the next step.

Already a subscriber? Add The Artifice in your preferences.

← More from The Artifice