Openai’s Blog Post Claims That GPT-5 Beats Its Previous Models On Several Coding Benchmarks, Including Swe-Bench Verified (Scoring 74.9 Percent), Swe-Lancencer (GPT-5-Thinking Scored 55 Percent), and Aider Polyglot (Scored 88 percent), which Test the Modelity to Fix Bugs, Complete Freelance-Style Coding Tasks, and Work Across Multiple Programming Languages.
During the press information on Wednesday, Openai-by-training protagonist Yann Dubois urged GPT-5 to “create a beautiful, highly interactive online program for my partner, English, to learn French.” He tasked the AI to include functions such as Daily Progress, various activities like Flashcards and Quizzes, and realized that he wants the app to be wrapped in a “very engaging topic.” Within a minute or more, the AI-generated app appeared. While it was only one on-rails demo, the result was an elegant site that delivered exactly what Dubois asked for.
“It’s a great coding collaborator, and also stands out at agent tasks,” Michelle Pokrass, a post-training lead, says. “It executes long chains and tools [which means it better understands when and how to use functions like web browsers or external APIs]Following are detailed instructions, and gives previous explanations of its actions. ”
Openai also says on his blog that GPT-5 is “our best model still for healthy questions.” In three Openai-san-related LLM references-Healthbench, Healthbench Hard, and Healthbench consent-The system -card (A document that describes the technical capabilities of the product and other research findings) states that GPT-5 thinking exceeds previous models “by a major margin.” GPT-5’s mindset scored 25.5 percent on Healthbench Hard, up from O3’s 31.6 percent score. These scores are validated by two or more doctors, according to the system card.
The model is also supposed to hallucinate less, according to Pokrass, a common thing for AI where it gives false information. Openai Alex Beutel’s security research adds that they have “significantly reduced the rates of deception in GPT-5.”
“We have taken steps to reduce the inclination of GPT-5 thinking to cheat, cheat problems or hack problems, although our mitigations are not perfect and need more research,” the System Card says. “In particular, we trained the model to fail gracefully when it performed tasks that it cannot solve.”
The company’s systematic card says that after testing GPT-5 models without access to browsing, researchers have found its hallucinatory rate (which they defined as “percent of actual claims that contain small or serious errors”) 26 percent less common than the GPT-4O model. GPT-5 thinking has a 65 percent reduced hallucinatory rate compared to O3.
For promises that could be double (perhaps harmful or benign), Beutel says that GPT-5 uses “secure completions”, which encourages the model to “give as helpful answer as possible, but in the restrictions of stay.” Openai has made over 5,000 hours of red team, according to Beutel, and testing with external organizations to ensure that the system is robust.
Openai says it now has nearly 700 million ChatGPT users, 5 million paying business users, and 4 million developers using the APIs.
“The vibrations of this model are really good, and I think people will really feel that, ‘Chatgpt Nick Turley says. “Especially average people who didn’t spend their time thinking about models.”