close
close

topicnews · October 12, 2024

Reasoning flaws highlighted by Apple research on LLMs

Reasoning flaws highlighted by Apple research on LLMs

Apple plans to introduce its own version of AI starting with iOS 18.1 – image credit Apple

A new paper from Apple artificial intelligence scientists has found that engines based on large language models like those from Meta and OpenAI still lack basic reasoning capabilities.

The group has proposed a new benchmark, GSM-Symbolic, to help others measure the reasoning abilities of various large language models (LLMs). Their initial tests show that small changes in query wording can result in significantly different answers, reducing the reliability of the models.

The group explored the “fragility” of mathematical thinking by adding contextual information to their queries that a human could understand, but that should have no bearing on the basic mathematics of the solution. This led to different answers about what shouldn’t happen.

“Specifically, the performance of all models decreases [even] if only the numerical values ​​in the question are changed in the GSM Symbolic benchmark,” the group wrote in its report. “Also, the fragility of the mathematical argumentation in these models.” [demonstrates] that their performance deteriorates significantly as the number of clauses in a question increases.”

The study found that adding a single sentence that appears to provide relevant information about a particular math question can reduce the accuracy of the final answer by up to 65 percent. “There is simply no way to build reliable agents on this basis, where changing one or two words in an irrelevant way or adding some irrelevant information can lead to a different answer,” the study says.

A lack of critical thinking

One particular example that illustrates the problem was a math problem that required a real understanding of the question. The task the team developed, called “GSM-NoOp,” was similar to math “word problems” an elementary school student might encounter.

The query began with the information needed to formulate a result. “Oliver picks 44 kiwis on Friday. Then on Saturday he picks 58 kiwis. On Sunday he picks twice as many kiwis as on Friday.”

The query then adds a clause that seems relevant, but is not in terms of the final answer. She notes that of the kiwis picked on Sunday, “five were slightly smaller than average.” The requested answer was simply: “How many kiwis does Oliver have?”

Indication of the size of some kiwis picked on Sunday should not affect the total number of kiwis picked. However, both the OpenAI model and Meta’s Llama3-8b subtracted the five smaller Kiwis from the total result.

The flawed logic was supported by a previous study in 2019 that could reliably confuse AI models by asking a question about the age of two previous Super Bowl quarterbacks. By adding background and related information about the games they played in and a third person who was a quarterback in another bowl game, the models produced incorrect answers.

“We found no evidence of formal reasoning in language models,” the new study concluded. The behavior of LLMS is “better explained by sophisticated pattern matching,” which, according to the study, “is actually so fragile that…” [simply] Changing the name can change the results.”