Still no sign of reasoning.
- Apple has stuck the knife into the idea that large language models (LLMs) can reason with a research paper demonstrating LLMs making the sorts of mistakes that a small child or a pocket calculator would never make.
- The research paper (see here) and the X thread (see here) that summarises the findings in a more digestible format, raise serious questions and support many of the observations that I have made over the last 4 years or so.
- OpenAI has a reasoning test called GSM8K upon which GPT-3 scored 35% but these days, even 8bn parameter models can score 85%+ with the conclusion being drawn that models can now reason.
- Apple set out to test this by taking parts of the GSM8K data set and changing things like the name of the person and the number of objects which are changes that a pocket calculator can handle with 100% accuracy.
- The researchers observed a large variation in performance where most models scored worse than they did on the original GSM8K test.
- The researchers also found further decreases in accuracy when they made the questions more difficult by adding more operations or more conditions to a calculation.
- However, most telling was a test to see if the models really understood the mathematical concepts by inserting extra information into a question which was irrelevant to calculating the answer (GSM_NoOp).
- Once again, a small child and a pocket calculator would see straight through this trick, but the LLMs did not with all of them registering a significant decline in performance.
- However, this test clearly shows that not all models are equal with OpenAI’s o1-preview showing the smallest decline of 17.5% in accuracy and Phi-3-mini falling by 65.7% (fig. 8).
- There is also a general bias towards size with the bigger models doing better than some of the smaller ones, but this was this was far from absolute which also raises questions about the “bigger is better” approach to LLMs.
- The authors concluded that LLM “behaviour is better explained by sophisticated pattern matching – so fragile, in fact, that changing names can alter results by 10%”.
- This has been precisely RFM’s view since 2018 although RFM has never had such good data to back up this view.
- There is also plenty of other evidence to suggest that this is the case such as OpenAI’s own data from its original GPT-3 paper in 2020 (see here) and other tests that show a very large decline in performance as the number of operations or the number of digits in the numbers being operated increases.
- It has long been my contention that somewhere in these massive data sets, the answers to these questions are to be found but as the operations become more complex, the probability of those answers being present declines exponentially yielding the results observed.
- These researchers conclude something similar although their explanation of what is happening is far more detailed and credible than mine.
- The net result is that I see more evidence that the machines are unable to reason and no real evidence that they can, but still, the creators of LLMs push their creations’ ability to reason.
- This is because the first step to creating superintelligent machines (much promised and very little delivered) will be the ability of these systems to reason which is why this debate is so important.
- If the machines can’t reason, then the promises upon which multi-billion-dollar valuations are built will fall to pieces, setting the AI industry up for a painful correction.
- Much like the Internet before it, I think that AI has a long and bright future, but current expectations are way beyond what is possible now meaning that a painful reset is required as reality reasserts itself.
- For the moment, more excitement will lead to more investment, which means more demand for Nvidia, which remains the safest way to play the generative AI craze directly.
- However, I continue to prefer the adjacencies of inference at the edge (Qualcomm) and nuclear power as their valuations are far more reasonable and both will still perform well even if generative AI fails to live up to expectations.
Blog Comments
Martin Jackson
October 15, 2024 at 11:54 am
I’m not necessarily surprised by this: LLMs are trained to associate words with one another. A different models/training would be required to make logical deductions, mathematics, etc. This might be possible by using an LLM and a RAG which does seem like an effective combination.