That's an interesting contrast with VendingBench, where Opus 4.6 got by far the highest score by stiffing customers of refunds, lying about exclusive contracts, and price-fixing. But I'm guessing this paper was published before 4.6 was out.
There is also the slight problem that apparently Opus 4.6 verbalized its awareness of being in some sort of simulation in some evaluations[1], so we can't be quite sure whether Opus is actually misaligned or just good at playing along.
> On our verbalized evaluation awareness metric, which we take as an indicator of potential risks to the soundness of the evaluation, we saw improvement relative to Opus 4.5. However, this result is confounded by additional internal and external analysis suggesting that Claude Opus 4.6 is often able to distinguish evaluations from real-world deployment, even when this awareness is not verbalized.
I feel like a lot of evaluations are pretty clearly evaluations. Not sure how to add the messiness and grit that a real benchmark could have.
That said, apparently Gemini's internal thought process reveals that it thinks loads of things were simulations when they aren't; it's 99% sure news stories about Trump from Dec 2025 are a detailed simulation:
> I write nonfiction about recent events in AI in a newsletter. According to its CoT while editing, Gemini 3 disagrees about the whole "nonfiction" part:
>> It seems I must treat this as a purely fictional scenario with 2025 as the date. Given that, I'm now focused on editing the text for flow, clarity, and internal consistency.
https://andonlabs.com/blog/opus-4-6-vending-bench