basically in my testing really felt that gpt5 was "using tools to think" rather than just "using tools". it gets very powerful when coding long horizon tasks (a separate post i'm publishing later).
to give one substantive example, in my developer beta (they will release the video in a bit) i put it to a task that claude code had been stuck on for the last week - same prompts - and it just added logging to instrument some of the failures that we were seeing and - from the logs that it added and asked me to rerun - figured out the solve.
Was just skimming along that review, while watching the live-stream, where they just mentioned how much better at writing prose GPT-5 is, while I skimmed across:
> It’s actually worse at writing than GPT-4.5
Sounds like we need to wait a bit for the dust to settle before one can trust anything one hears/reads :)
This is why you need to have your own set of personal benchmarks. I have a few short stories that I have models continue (ones I wrote ages ago in my youth) or refactor. Some models are fantastic at writing but miss key details or enmesh them (Claude). Some are terrible writers at higher reasoning (o3). Some are decent writers but tend to provide very short outputs (gpt-4o). For my personal benchmarks Gemini 2.5 Pro has always generated the most compelling writing that _also_ sticks to the world/script -- and sometimes surprises me by having characters react in ways that I hadn't considered but are consistent with their "worldview" as presented by the context (usually a world guide).
I find coding to be harder to benchmark because there are so many ways to write the same solution. A "correct" solution may be terrible in another context due to loss of speed, security, etc.
> I find coding to be harder to benchmark because there are so many ways to write the same solution. A "correct" solution may be terrible in another context due to loss of speed, security, etc.
Yeah, I also do my own benchmarks, but as you said, coding ones are a bit harder. Currently I'm mostly benchmarking the accuracy of the tools I've written, which do bite-sized work. One tool is for editing parts of files, another to rewrite files fully, and so on, and each is individually benchmarked. But they're very specific, I do no overall benchmark for "Change feature X to do Y" which would span a "session", haven't find any good way of evaluating the results, just like you :)
I don’t think that’s a valid excuse. Yes marketing speak has always existed, but it’s not like companies have always been completely unreliable.
I found it strange that, despite my excitement for such an event being roughly equivalent to WWDC these days, I had 0 desire to watch the live stream for exactly this reason: it’s not like they’re going to give anything to us straight.
Even this years WWDC I at least skipped through video afterwards. Before I used to have watch parties. Yes they’re overly positive and paint everything in a good light, but they never felt… idk whatever the vibe is I get from these (applicable to OpenAI, Grok, Meta, etc)
It’s been just a few years of a revolutionary technology and already the livestreams are less appealing than the biggest corporations yearly events. Personally I find that sad
It has been pointed out, but GPT-5 is really five different models with wildly different capabilities under the hood. Which model get picked for the task at hand is, for now, not deterministic.
Hi Swyx I always appreciate your insights, something you wrote really resonated with a personal theory I've been developing:
>"While I never use AI for personal writing (because I have a strong belief in writing to think)"
The optimal AI productivity process is starting to look like:
AI Generates > Human Validates > Loop
Yet cognitive generation is how humans learn and develop cognitive strength, as well as how they maintain such strength.
Similar to how physical activity is how muscles/bone density/etc grow, and how body tissues maintain.
Physical technology freed us from hard physical labor that kept our bodies in shape -- at a cost of physical atrophy.
AI seems to have a similar effect for our minds. AI will accelerate our cognitive productivity, and allow for cognitive convenience -- at a cost of cognitive atrophy.
At present we must be intentional about building/maintaining physical strength (dedicated strength training, cardio, etc).
Soon we will need to be intentional about building/maintaining cognitive strength.
I suspect the workday/week of the future will be split on AI-on-a-leash work for optimal productivity, with carve-outs for dedicated AI-enhanced-learning solely for building/maintaining cognitive health (where productivity is not the goal, building/maintaining cognition is). Similar to how we carve out time for working out.
What are your thoughts on this? Based on what you wrote above, it seems you have similar feelings?
Granted, that article refers to retrieval specifically being one major way we learn, and of course learning incorporates many dimensions. But it seems a bit self-evident that retrieval occurs heavily during active problem solving (ie "generation"), and less so during passive learning (ie: just reading/consuming info).
From personal experience, I always noticed I learned much more by doing than by consuming documentation alone.
But yes, I admit this assumption and my own personal experience/bias is doing a lot of heavy lifting for me...
2) Regarding the "optimal AI productivity process" (AI Generates > Human Validates > Loop)
I'm using Karpathy's productivity loop described in his AI startup school talk last month here:
Does this help make it more concrete Swyx (name dropping you here since I'm pretty sure you've got a social listener set for your handle ;)? Love to hear your thoughts straight from the hip based on your own personal experiences.
Full disclosure: I'm not trying to get too academic about this. In all honestly I'm really trying to get to an informal theory that's useful and practical enough that it can be turned into a regular business process for rapid professional development.
I found a Hacker News thread via Google a few days ago. One of the top comments was from someone describing their RAG architecture and a certain technique (my search term). The comment boasted that their system was so good it that their team thought they created something close to AGI.
Then I noticed the date on the comment: 2023.
Technically, every advancement in the space is “the closest to AGI that we’ve ever been”. It’s technically correct, since we’re not moving backward. It’s just not a very meaningful statement.
AGI, like AI before it, has been coopted into a marketing term. Most of the time, outside of sci-fi, what people mean when they say AGI is "a profitable LLM".
In the words OpenAI: “AGI is defined as highly autonomous systems that outperform humans at most economically valuable work”
I was not trying to defend him. I'm very annoyed at how these words are being intentionally abused; they chose to recycle the term rather to create a new one exactly to create this confusion. It's still important to know what the grifters mean.
Re the substantive example, doesn't Claude Code already do this? When I'm vibe-coding scripts or even mobile apps cc can be pretty aggressive about adding targeted logging to solve specific issues.
Livestream link: https://www.youtube.com/live/0Uu_VJeVVfo
Research blog post: https://openai.com/index/introducing-gpt-5/
Developer blog post: https://openai.com/index/introducing-gpt-5-for-developers
API Docs: https://platform.openai.com/docs/guides/latest-model
Note the free form function calling documentation: https://platform.openai.com/docs/guides/function-calling#con...
GPT5 prompting guide: https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_g...
GPT5 new params and tools: https://cookbook.openai.com/examples/gpt-5/gpt-5_new_params_...
GPT5 frontend cookbook: https://cookbook.openai.com/examples/gpt-5/gpt-5_frontend
prompt migrator/optimizor https://platform.openai.com/chat/edit?optimize=true
Enterprise blog post: https://openai.com/index/gpt-5-new-era-of-work
System Card: https://openai.com/index/gpt-5-system-card/
What would you say if you could talk to a future OpenAI model? https://progress.openai.com/
coding examples: https://github.com/openai/gpt-5-coding-examples