It's very unclear if OpenAI has been casually leaking things to create buzz, but...

stri8ed · 2025-08-07T16:57:05 1754585825

That problem along with its many solutions are surely littered throughout the training data. Not to mention, it would be trivial to overfit on that problem. I don't know why people still reference that.

ben_w · 2025-08-07T17:02:57 1754586177

> That problem along with its many solutions are surely littered throughout the training data. Not to mention, it would be trivial to overfit on that problem.

It would be trivial to over-fit, if that was their goal.

But why would there be a large number of good SVG images of pelicans on bikes? Especially relative to all the things we actually want them to generalise over?

Surely most of the SVG images of pelicans on bikes are, right now, going to be "look at this rubbish AI output"? (Which may or may not be followed by a comment linking to that artist who got humans to draw bikes and oh boy were those humans wildly bad at drawing bikes, so an AI learning to draw SVGs from those bitmap pictures would likely also still suck…)

AlecSchueler · 2025-08-07T17:04:20 1754586260

Because it's become the iconic test for them and countless articles have been written about it with plenty of examples.

ben_w · 2025-08-07T17:06:18 1754586378

I added the word "good" in there, you may have replied before seeing that edit.

Xenoamorphous · 2025-08-07T17:15:02 1754586902

Maybe we can try “dog in a paraglider”? If it fails then we know it’s overfitting, if it works then the model generalises well?

aliljet · 2025-08-07T17:00:21 1754586021

Honestly, you're probably right. It's quickly become a pretty weak eval, but the guy that's running that eval is excellent. I'd much rather the evals people were using to test these things looked more like classic/boring engineering problems: deploy to dev/test/stage/prod with digital ocean, cloudflare, github, and a common git flow. Boring problem, I know, but that problem is wildly complex when you start to add a few extra dimensions (frontend vs backend, ports shifting between deployments, local deployments, etc.).

93po · 2025-08-07T17:01:13 1754586073

i think the point is people assume models arent overfitting for it, and its a fun/silly way to potentially gauge its general abilities