Hacker Newsnew | past | comments | ask | show | jobs | submitlogin





Holy misleading graph Batman/Altman!

Academic benchmark score improves only 5% but they make the bar 50% higher.


Which graph? There are dozens of graph on the page.

they had a bad graph on stream which they have since fixed. wouldnt get too upset about a simple error

our hands on review: https://www.latent.space/p/gpt-5-review

basically in my testing really felt that gpt5 was "using tools to think" rather than just "using tools". it gets very powerful when coding long horizon tasks (a separate post i'm publishing later).

to give one substantive example, in my developer beta (they will release the video in a bit) i put it to a task that claude code had been stuck on for the last week - same prompts - and it just added logging to instrument some of the failures that we were seeing and - from the logs that it added and asked me to rerun - figured out the solve.


Was just skimming along that review, while watching the live-stream, where they just mentioned how much better at writing prose GPT-5 is, while I skimmed across:

> It’s actually worse at writing than GPT-4.5

Sounds like we need to wait a bit for the dust to settle before one can trust anything one hears/reads :)


This is why you need to have your own set of personal benchmarks. I have a few short stories that I have models continue (ones I wrote ages ago in my youth) or refactor. Some models are fantastic at writing but miss key details or enmesh them (Claude). Some are terrible writers at higher reasoning (o3). Some are decent writers but tend to provide very short outputs (gpt-4o). For my personal benchmarks Gemini 2.5 Pro has always generated the most compelling writing that _also_ sticks to the world/script -- and sometimes surprises me by having characters react in ways that I hadn't considered but are consistent with their "worldview" as presented by the context (usually a world guide).

I find coding to be harder to benchmark because there are so many ways to write the same solution. A "correct" solution may be terrible in another context due to loss of speed, security, etc.


> I find coding to be harder to benchmark because there are so many ways to write the same solution. A "correct" solution may be terrible in another context due to loss of speed, security, etc.

Yeah, I also do my own benchmarks, but as you said, coding ones are a bit harder. Currently I'm mostly benchmarking the accuracy of the tools I've written, which do bite-sized work. One tool is for editing parts of files, another to rewrite files fully, and so on, and each is individually benchmarked. But they're very specific, I do no overall benchmark for "Change feature X to do Y" which would span a "session", haven't find any good way of evaluating the results, just like you :)


Here’s one of my favourite questions. Every single Model prior to this including 4o, o3 use to fumble this. https://photos.app.goo.gl/zzfKKqDYFtMAP9Vb8

I am yet to test this out end to end


well it's difficult to trust the people selling it in the first place. They're too biased to not lie

It's hard to make a man understand something standing between them and their salary


I don’t think that’s a valid excuse. Yes marketing speak has always existed, but it’s not like companies have always been completely unreliable.

I found it strange that, despite my excitement for such an event being roughly equivalent to WWDC these days, I had 0 desire to watch the live stream for exactly this reason: it’s not like they’re going to give anything to us straight.

Even this years WWDC I at least skipped through video afterwards. Before I used to have watch parties. Yes they’re overly positive and paint everything in a good light, but they never felt… idk whatever the vibe is I get from these (applicable to OpenAI, Grok, Meta, etc)

It’s been just a few years of a revolutionary technology and already the livestreams are less appealing than the biggest corporations yearly events. Personally I find that sad


It has been pointed out, but GPT-5 is really five different models with wildly different capabilities under the hood. Which model get picked for the task at hand is, for now, not deterministic.

better than 4o but worse than 4.5 is internally consistent. and ofc writing is extremely multidimensional.

But that’s not what the review says:

“It’s actually worse at writing than GPT-4.5, and I think even 4o”

So the review is not consistent with the PR, hence the commenter expressing preference for outside sources.


Hi Swyx I always appreciate your insights, something you wrote really resonated with a personal theory I've been developing:

>"While I never use AI for personal writing (because I have a strong belief in writing to think)"

The optimal AI productivity process is starting to look like:

AI Generates > Human Validates > Loop

Yet cognitive generation is how humans learn and develop cognitive strength, as well as how they maintain such strength.

Similar to how physical activity is how muscles/bone density/etc grow, and how body tissues maintain.

Physical technology freed us from hard physical labor that kept our bodies in shape -- at a cost of physical atrophy.

AI seems to have a similar effect for our minds. AI will accelerate our cognitive productivity, and allow for cognitive convenience -- at a cost of cognitive atrophy.

At present we must be intentional about building/maintaining physical strength (dedicated strength training, cardio, etc).

Soon we will need to be intentional about building/maintaining cognitive strength.

I suspect the workday/week of the future will be split on AI-on-a-leash work for optimal productivity, with carve-outs for dedicated AI-enhanced-learning solely for building/maintaining cognitive health (where productivity is not the goal, building/maintaining cognition is). Similar to how we carve out time for working out.

What are your thoughts on this? Based on what you wrote above, it seems you have similar feelings?

Is there a name for this theory?

If not can you coin one? You're great at that :)


This is very interesting - I like the way you’ve explained this.

The parallel with “intentionally working out to maintain physical strength” is extremely helpful as an analogy to communicate this concept.

That’s exactly what we might be faced with… cognitive atrophy…

It’s arguably already started, and is accelerating!


thanks very much :)

problem with your theory is it bundles 2-3 steps which each could be their own theses

suggest you nail those down before building up to a general bundle (or mental model/framework)


Ah, I probably should have listed some of the "assumptions" I'm developing it on top of:

1) Regarding the "generation is how learning occurs" claim, I'm going off of this:

https://www.learningscientists.org/blog/2024/3/7/how-does-re...

Granted, that article refers to retrieval specifically being one major way we learn, and of course learning incorporates many dimensions. But it seems a bit self-evident that retrieval occurs heavily during active problem solving (ie "generation"), and less so during passive learning (ie: just reading/consuming info).

From personal experience, I always noticed I learned much more by doing than by consuming documentation alone.

But yes, I admit this assumption and my own personal experience/bias is doing a lot of heavy lifting for me...

2) Regarding the "optimal AI productivity process" (AI Generates > Human Validates > Loop)

I'm using Karpathy's productivity loop described in his AI startup school talk last month here:

https://youtu.be/LCEmiRjPEtQ?t=1327

Does this help make it more concrete Swyx (name dropping you here since I'm pretty sure you've got a social listener set for your handle ;)? Love to hear your thoughts straight from the hip based on your own personal experiences.

Full disclosure: I'm not trying to get too academic about this. In all honestly I'm really trying to get to an informal theory that's useful and practical enough that it can be turned into a regular business process for rapid professional development.


”I think GPT-5 is the closest to AGI we’ve ever been”

Sorry, but this sounds like overly sensational marketing speak and just leaves a bad taste in the mouth for me.


I found a Hacker News thread via Google a few days ago. One of the top comments was from someone describing their RAG architecture and a certain technique (my search term). The comment boasted that their system was so good it that their team thought they created something close to AGI.

Then I noticed the date on the comment: 2023.

Technically, every advancement in the space is “the closest to AGI that we’ve ever been”. It’s technically correct, since we’re not moving backward. It’s just not a very meaningful statement.


> Technically, every advancement in the space is “the closest to AGI that we’ve ever been”

By that standard Neolithic tool use was progress to AGI.


Technically correct

"It's the best iPhone we ever made."

AGI, like AI before it, has been coopted into a marketing term. Most of the time, outside of sci-fi, what people mean when they say AGI is "a profitable LLM".

In the words OpenAI: “AGI is defined as highly autonomous systems that outperform humans at most economically valuable work”


SamA isn't an idiot. When he says AGI he wants you to think of Asimov style AI. Don't run defense for billionaire grifters.

I was not trying to defend him. I'm very annoyed at how these words are being intentionally abused; they chose to recycle the term rather to create a new one exactly to create this confusion. It's still important to know what the grifters mean.

Same marketing BS like "the best iPhone ever!". Well, duh, if your new version (of hardware/software) isn't better, what the deal then?

Great writeup. I particularly like the idea of splitting your tools into the four buckets.

1)Internal Retrieval

2)Web Search

3)Code Interpreter

4)Actions

How did you come up with this idea?


ive been pushing the idea of "the Big 3" tools (https://news.smol.ai/issues/25-05-27-mistral-agents and https://www.latent.space/p/agent) and Ben added a 4th

Re the substantive example, doesn't Claude Code already do this? When I'm vibe-coding scripts or even mobile apps cc can be pretty aggressive about adding targeted logging to solve specific issues.

well idk for me it didnt and gpt5 did haha


The coding examples link returns a 404.


Aaand hugged to death.

edit:

livestream here: https://www.youtube.com/live/0Uu_VJeVVfo




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: