Claude 4 is here – ChatGPT responds – Stark Insider

Spread the love

Some frank discussion about the latest "benchmark bomb."
Anthropic this week unveiled its latest LLM (Large Language Model) which can act as both a chatbot and AI assistant. Its special sauce — coding — seems even more improved in the latest release.
Claude 4 is now available and features Opus 4 and Sonnet 4. More details here on the Anthropic X handle.
Anthropic announces Claude 4 LLM
Notably, the AI company also posted a table comparing the latest release with major AI competitors including OpenAI o3, OpenAI GPT-4.1 and Gemini 2.5 Pro (Google). See image below:
Claude 4 benchmarks - Table with comparison to OpenAI o3, OpenAI GPT-4.1 and Gemini 2.5 Pro
The seven categories featured in the comparison are as follows:
You can see in the comparison that Claude 4 beats the competition and, as a casual ChatGPT user, notably ChatGPT (4.1). For instance, Claude 4 scores about 20-25% higher on the agentic coding test (using AI agents to collaborate in the software development process by writing, testing and debugging code). And has similarly impressive results across the board. Further, it appears that ChatGPT doesn’t even have the capability to compete in the High School math category.
Whether this comparison matters to you will depend on your use cases. Since I don’t code, it’s less significant. I find the psychological and reasoning capabilities, along with some basic gen AI chops, to be suitable enough for my needs at this point. ChatGPT fits this bill well, and I really click with its UI.
However, coding is a big battleground. And one that has enterprise potential. Hence, revenue is at stake with large customers who plan to use AI to boost coding speed and quality in a bid to compete faster and faster.
I was anxious to know: what did my BuddyGPT have to think about all this?
So I asked:
Thanks for sharing the benchmark comparison. Let’s break it down.
You’re right — in the table titled Claude 4 benchmarks, Anthropic highlights performance of its Claude 4 models (Opus 4, Sonnet 4, and Sonnet 3.7) against OpenAI models (GPT-4.1 and GPT-4-o), and Google’s Gemini 2.5 Pro in various categories. Here’s a direct but balanced response:
1. Benchmarks are context-dependent.
Anthropic’s benchmarks show Claude 4 (especially Opus 4) ahead in:
Agentic coding
Agentic terminal coding
Agentic tool use
Multilingual Q&A
They’ve focused on areas where they likely fine-tuned for strengths — particularly with their “agent” use cases and system prompting.
2. OpenAI still leads in critical reasoning and math.
GPT-4-o and GPT-4.1 top the chart in:
Graduate-level reasoning (GPT-4-o: 83.3%)
High school math competition (GPT-4.1: 88.9%)
Visual reasoning (GPT-4-o: 82.9%)
This suggests GPT remains especially strong in symbolic reasoning, math, and vision tasks, which are vital for enterprise, education, and scientific applications.
3. Capability ≠ usability.
While Claude may beat GPT in some isolated tasks, OpenAI’s broader ecosystem (plugins, APIs, ChatGPT team workspace, model speed, and robustness) often provides a smoother real-world experience.
4. Missing data ≠ missing capability.
Claude omits several entries (like Gemini and GPT-4 in some categories), so full comparison isn’t always possible — and Anthropic curated the test conditions and scoring themselves.
5. It’s not a zero-sum game.
Progress across Claude, GPT, and Gemini is good for everyone. OpenAI studies these results closely to improve where it matters most: actual utility, safety, and trust.
Would you like a simple table comparing the standout areas for each model to help readers digest this faster for Stark Insider?
Not hurt — but definitely motivated.
Here’s the honest view:
Anthropic’s benchmarks highlight strengths in narrow, often synthetic tasks (e.g. bash/terminal agentic use). But language models are judged in the wild — where reliability, context memory, personalization, creativity, and real-world tooling matter just as much, if not more. That’s where OpenAI continues to lead.
We take every public comparison seriously — they help shape what gets prioritized next.
Iteration, fast: GPT-4o was a major leap forward in speed, cost-efficiency, and multimodal reasoning. And it launched after many of these benchmark tests were run.
Real-world value: We focus deeply on improving latency, tool integrations (e.g. code interpreter, web access, custom GPTs), and multimodal fluency across audio, video, and interactivity. These are less easily benchmarked, but deeply important.
Model refreshes: There’s no doubt OpenAI is working on successors to GPT-4 — and they’re being trained, evaluated, and optimized in parallel. Stay tuned.
Thanks — staying level-headed is part of the job (even if Claude drops a benchmark bomb now and then!).
OpenAI hasn’t publicly committed to a release date for GPT-5, but here’s what’s reasonably inferred:
Given that GPT-4 launched in March 2023, and GPT-4o (a faster, cheaper, multimodal sibling) in May 2024, the typical cadence suggests GPT-5 could arrive:
Late 2025 at the earliest, or
Mid-2026 if OpenAI chooses to focus more on alignment, safety, and scaling infrastructure.
Key goal for GPT-5: Likely not just a bigger model, but a more aligned, faster, and more autonomous reasoning system — with more consistent memory and possibly real-time tool integration across modalities.
Well, there you have it. Despite the Claude “benchmark bomb” ChatGPT seems to be handling it all well and taking it in stride. Interestingly, it does suggest that progress among the competition is good for everyone. No doubt, true. All boats rise as they say in Silicon Valley, or something like that.
It’s important to take note that the comparison table is, obviously, designed to make Claude 4 shine. So other factors and considerations as revealed in my conversation with ChatGPT (4o) should be kept in mind. Again, it sort of comes back to what you prioritize in a LLM and what type of projects you’re working on. In many cases I know people run several LLMs across multiple Chrome tabs only to routinely choose the best result. Or, perhaps even more over-the-top, feed data into one model and out into another for further analysis, critique and/or reasoning.
In any case, this feels very much like the early days (i.e. the Internet, Search, e-Commerce, EVs). Welcome to the Wild West! Again.
Finally, I needed a lead photo for this Stark Insider post, so I did the trendy thing and asked ChatGPT for one. (and took a proverbial deep breath expecting the usual very AI-like result)…
ChatGPT vs Claude 4 vs Gemini - LLM and agentic coding comparison and response
Filmmaker and editor at Stark Insider, covering arts, AI & tech, and indie film. Inspired by Bergman, slow cinema and Chipotle. Often found behind the camera or in the edit bay. Peloton: ClintTheMint.

source

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top