Research  Top AI coding tools make mistakes one in four times

#1
C C Offline
https://uwaterloo.ca/news/media/top-ai-c...four-times

PRESS RELEASE: New research from the University of Waterloo shows that artificial intelligence (AI) still struggles with some basic software development tasks, raising questions about how reliably AI systems can assist developers. As Large Language Models (LLMs) are increasingly incorporated into software development, developers have struggled to ensure that AI-generated responses are accurate, consistent, and easy to integrate into larger development workflows.

Previously, LLMs responded to software development prompts with free-form natural language answers. To address this problem, several AI companies, including OpenAI, Google and Anthropic, have introduced “structured outputs”. These outputs force LLM responses to follow predefined formats such as JSON, XML, or Markdown, making them easier for both humans and software systems to read and process.

A new benchmarking study from Waterloo, however, shows that the technology is not yet as reliable as many developers had hoped. Even the most advanced models achieved only about 75 per cent accuracy in the tests, while open-source models performed closer to 65 per cent. The study evaluated 11 LLM models across 18 structured output formats and 44 tasks designed to assess how reliably the systems followed structured rules.

“With this kind of study, we want to measure not only the syntax of the code – that is, whether it’s following the set rules – but also whether the outputs produced for various tasks were accurate,” said Dongfu Jiang, a PhD student in computer science and co-first author on the research. “We found that while they do okay with text-related tasks, they really struggle on tasks involving image, video, or website generation.”

The study was a collaborative effort involving Waterloo’s Jialin Yang, an undergraduate student, and Dr. Wenhu Chen, an assistant professor of computer science, and incorporated annotations from 17 other researchers at Waterloo and around the world.

“There have been a lot of similar benchmarking projects happening in our labs recently,” Chen said. “At Waterloo, students often begin as annotators, then organize projects and create their own benchmarking studies. They’re not just using AI in their studies – they’re building, researching and evaluating it.”

While LLM-structured outputs are an exciting step for software development, the researchers say the systems are not yet reliable enough to operate without human oversight. “Developers might have these agents working for them, but they still need significant human supervision,” Jiang said.

The research, “StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs,” appears in Transactions on Machine Learning Research and will be presented at ICLR 2026.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Article Researcher to create new coding language, computing infrastructure C C 0 887 Feb 27, 2025 08:44 PM
Last Post: C C
  Article AI makes weird mistakes that are different from human + Could AI replace politicians? C C 0 518 Jan 13, 2025 11:28 PM
Last Post: C C
  Research AI boosts creativity at a cost + AI maths + AI makes human-like reasoning mistakes C C 0 579 Jul 17, 2024 04:43 PM
Last Post: C C
  Article Artificial intelligence: Four debates to expect in 2024 C C 0 422 Jan 3, 2024 02:01 AM
Last Post: C C
  MARKETING AN APP AFTER CODING IT Ostronomos 5 954 Jan 28, 2023 08:05 AM
Last Post: stryder
  Do recommender tools really improve decision making? + IBM clears 100-qubit mark C C 0 362 Nov 17, 2021 07:40 PM
Last Post: C C
  Artificial intelligence makes blurry faces look more than 60 times sharper C C 0 434 Jun 14, 2020 12:27 AM
Last Post: C C
  YouTube algorithm mistakes fighting robots for animal cruelty + Ransomware evolving C C 0 449 Aug 21, 2019 06:54 PM
Last Post: C C
  The Four Greatest Mysteries of the Internet Magical Realist 3 1,112 Dec 17, 2016 04:48 AM
Last Post: Secular Sanity



Users browsing this thread: 1 Guest(s)