# GPT5 all official links Livestream link: https://www.youtube.com/live/0Uu_VJeV...

nabla9 · 2025-08-07T17:30:12 1754587812

Holy misleading graph Batman/Altman!

Academic benchmark score improves only 5% but they make the bar 50% higher.

ambicapter · 2025-08-07T17:32:31 1754587951

Which graph? There are dozens of graph on the page.

swyx · 2025-08-07T17:56:44 1754589404

they had a bad graph on stream which they have since fixed. wouldnt get too upset about a simple error

swyx · 2025-08-07T17:14:07 1754586847

our hands on review: https://www.latent.space/p/gpt-5-review

basically in my testing really felt that gpt5 was "using tools to think" rather than just "using tools". it gets very powerful when coding long horizon tasks (a separate post i'm publishing later).

to give one substantive example, in my developer beta (they will release the video in a bit) i put it to a task that claude code had been stuck on for the last week - same prompts - and it just added logging to instrument some of the failures that we were seeing and - from the logs that it added and asked me to rerun - figured out the solve.

diggan · 2025-08-07T17:20:14 1754587214

Was just skimming along that review, while watching the live-stream, where they just mentioned how much better at writing prose GPT-5 is, while I skimmed across:

> It’s actually worse at writing than GPT-4.5

Sounds like we need to wait a bit for the dust to settle before one can trust anything one hears/reads :)

dudeinhawaii · 2025-08-07T19:55:08 1754596508

This is why you need to have your own set of personal benchmarks. I have a few short stories that I have models continue (ones I wrote ages ago in my youth) or refactor. Some models are fantastic at writing but miss key details or enmesh them (Claude). Some are terrible writers at higher reasoning (o3). Some are decent writers but tend to provide very short outputs (gpt-4o). For my personal benchmarks Gemini 2.5 Pro has always generated the most compelling writing that _also_ sticks to the world/script -- and sometimes surprises me by having characters react in ways that I hadn't considered but are consistent with their "worldview" as presented by the context (usually a world guide).

I find coding to be harder to benchmark because there are so many ways to write the same solution. A "correct" solution may be terrible in another context due to loss of speed, security, etc.

diggan · 2025-08-07T21:24:04 1754601844

> I find coding to be harder to benchmark because there are so many ways to write the same solution. A "correct" solution may be terrible in another context due to loss of speed, security, etc.

Yeah, I also do my own benchmarks, but as you said, coding ones are a bit harder. Currently I'm mostly benchmarking the accuracy of the tools I've written, which do bite-sized work. One tool is for editing parts of files, another to rewrite files fully, and so on, and each is individually benchmarked. But they're very specific, I do no overall benchmark for "Change feature X to do Y" which would span a "session", haven't find any good way of evaluating the results, just like you :)

anshumankmr · 2025-08-08T01:04:04 1754615044

Here’s one of my favourite questions. Every single Model prior to this including 4o, o3 use to fumble this. https://photos.app.goo.gl/zzfKKqDYFtMAP9Vb8

I am yet to test this out end to end

tough · 2025-08-07T17:26:46 1754587606

well it's difficult to trust the people selling it in the first place. They're too biased to not lie

It's hard to make a man understand something standing between them and their salary

barrell · 2025-08-07T17:40:17 1754588417

I don’t think that’s a valid excuse. Yes marketing speak has always existed, but it’s not like companies have always been completely unreliable.

I found it strange that, despite my excitement for such an event being roughly equivalent to WWDC these days, I had 0 desire to watch the live stream for exactly this reason: it’s not like they’re going to give anything to us straight.

Even this years WWDC I at least skipped through video afterwards. Before I used to have watch parties. Yes they’re overly positive and paint everything in a good light, but they never felt… idk whatever the vibe is I get from these (applicable to OpenAI, Grok, Meta, etc)

It’s been just a few years of a revolutionary technology and already the livestreams are less appealing than the biggest corporations yearly events. Personally I find that sad

yen223 · 2025-08-07T22:58:36 1754607516

It has been pointed out, but GPT-5 is really five different models with wildly different capabilities under the hood. Which model get picked for the task at hand is, for now, not deterministic.

swyx · 2025-08-07T17:26:53 1754587613

better than 4o but worse than 4.5 is internally consistent. and ofc writing is extremely multidimensional.

WhitneyLand · 2025-08-07T17:44:36 1754588676

But that’s not what the review says:

“It’s actually worse at writing than GPT-4.5, and I think even 4o”

So the review is not consistent with the PR, hence the commenter expressing preference for outside sources.

billmalarky · 2025-08-07T18:24:13 1754591053

Hi Swyx I always appreciate your insights, something you wrote really resonated with a personal theory I've been developing:

>"While I never use AI for personal writing (because I have a strong belief in writing to think)"

The optimal AI productivity process is starting to look like:

AI Generates > Human Validates > Loop

Yet cognitive generation is how humans learn and develop cognitive strength, as well as how they maintain such strength.

Similar to how physical activity is how muscles/bone density/etc grow, and how body tissues maintain.

Physical technology freed us from hard physical labor that kept our bodies in shape -- at a cost of physical atrophy.

AI seems to have a similar effect for our minds. AI will accelerate our cognitive productivity, and allow for cognitive convenience -- at a cost of cognitive atrophy.

At present we must be intentional about building/maintaining physical strength (dedicated strength training, cardio, etc).

Soon we will need to be intentional about building/maintaining cognitive strength.

I suspect the workday/week of the future will be split on AI-on-a-leash work for optimal productivity, with carve-outs for dedicated AI-enhanced-learning solely for building/maintaining cognitive health (where productivity is not the goal, building/maintaining cognition is). Similar to how we carve out time for working out.

What are your thoughts on this? Based on what you wrote above, it seems you have similar feelings?

Is there a name for this theory?

If not can you coin one? You're great at that :)

tailspin2019 · 2025-08-07T22:39:32 1754606372

This is very interesting - I like the way you’ve explained this.

The parallel with “intentionally working out to maintain physical strength” is extremely helpful as an analogy to communicate this concept.

That’s exactly what we might be faced with… cognitive atrophy…

It’s arguably already started, and is accelerating!

swyx · 2025-08-07T21:44:43 1754603083

thanks very much :)

problem with your theory is it bundles 2-3 steps which each could be their own theses

suggest you nail those down before building up to a general bundle (or mental model/framework)

billmalarky · 2025-08-07T23:52:48 1754610768

Ah, I probably should have listed some of the "assumptions" I'm developing it on top of:

1) Regarding the "generation is how learning occurs" claim, I'm going off of this:

https://www.learningscientists.org/blog/2024/3/7/how-does-re...

Granted, that article refers to retrieval specifically being one major way we learn, and of course learning incorporates many dimensions. But it seems a bit self-evident that retrieval occurs heavily during active problem solving (ie "generation"), and less so during passive learning (ie: just reading/consuming info).

From personal experience, I always noticed I learned much more by doing than by consuming documentation alone.

But yes, I admit this assumption and my own personal experience/bias is doing a lot of heavy lifting for me...

2) Regarding the "optimal AI productivity process" (AI Generates > Human Validates > Loop)

I'm using Karpathy's productivity loop described in his AI startup school talk last month here:

https://youtu.be/LCEmiRjPEtQ?t=1327

Does this help make it more concrete Swyx (name dropping you here since I'm pretty sure you've got a social listener set for your handle ;)? Love to hear your thoughts straight from the hip based on your own personal experiences.

Full disclosure: I'm not trying to get too academic about this. In all honestly I'm really trying to get to an informal theory that's useful and practical enough that it can be turned into a regular business process for rapid professional development.

lyxell · 2025-08-07T17:28:27 1754587707

”I think GPT-5 is the closest to AGI we’ve ever been”

Sorry, but this sounds like overly sensational marketing speak and just leaves a bad taste in the mouth for me.

Aurornis · 2025-08-07T17:30:53 1754587853

I found a Hacker News thread via Google a few days ago. One of the top comments was from someone describing their RAG architecture and a certain technique (my search term). The comment boasted that their system was so good it that their team thought they created something close to AGI.

Then I noticed the date on the comment: 2023.

Technically, every advancement in the space is “the closest to AGI that we’ve ever been”. It’s technically correct, since we’re not moving backward. It’s just not a very meaningful statement.

JumpCrisscross · 2025-08-07T17:41:18 1754588478

> Technically, every advancement in the space is “the closest to AGI that we’ve ever been”

By that standard Neolithic tool use was progress to AGI.

wild_egg · 2025-08-07T17:47:24 1754588844

Technically correct

Dr4kn · 2025-08-07T17:35:58 1754588158

"It's the best iPhone we ever made."

Fargren · 2025-08-07T17:33:48 1754588028

AGI, like AI before it, has been coopted into a marketing term. Most of the time, outside of sci-fi, what people mean when they say AGI is "a profitable LLM".

In the words OpenAI: “AGI is defined as highly autonomous systems that outperform humans at most economically valuable work”

dingnuts · 2025-08-07T17:52:46 1754589166

SamA isn't an idiot. When he says AGI he wants you to think of Asimov style AI. Don't run defense for billionaire grifters.

Fargren · 2025-08-08T06:16:23 1754633783

I was not trying to defend him. I'm very annoyed at how these words are being intentionally abused; they chose to recycle the term rather to create a new one exactly to create this confusion. It's still important to know what the grifters mean.

martin_a · 2025-08-07T17:34:13 1754588053

Same marketing BS like "the best iPhone ever!". Well, duh, if your new version (of hardware/software) isn't better, what the deal then?

thierrydamiba · 2025-08-07T17:24:34 1754587474

Great writeup. I particularly like the idea of splitting your tools into the four buckets.

1)Internal Retrieval

2)Web Search

3)Code Interpreter

4)Actions

How did you come up with this idea?

swyx · 2025-08-07T17:26:34 1754587594

ive been pushing the idea of "the Big 3" tools (https://news.smol.ai/issues/25-05-27-mistral-agents and https://www.latent.space/p/agent) and Ben added a 4th

yen223 · 2025-08-07T20:40:58 1754599258

Re the substantive example, doesn't Claude Code already do this? When I'm vibe-coding scripts or even mobile apps cc can be pretty aggressive about adding targeted logging to solve specific issues.

swyx · 2025-08-07T21:43:49 1754603029

well idk for me it didnt and gpt5 did haha

elros · 2025-08-07T17:34:38 1754588078

https://www.youtube.com/watch?v=0Uu_VJeVVfo

slowmovintarget · 2025-08-07T17:05:31 1754586331

The coding examples link returns a 404.

owlninja · 2025-08-07T17:07:45 1754586465

https://github.com/openai/gpt-5-coding-examples

perching_aix · 2025-08-07T17:07:42 1754586462

Aaand hugged to death.

edit:

livestream here: https://www.youtube.com/live/0Uu_VJeVVfo