Artificial Intelligence – Reasoning Debate pt. II

o1 runs before it can walk.

  • OpenAI has blown the reasoning argument wide open again with the launch of o1 but the model still makes fundamental mistakes implying it still does not have the basic building blocks of reasoning leading me to wonder whether the rest is still just an illusion.
  • This “breakthrough” in LLM capability comes just at the moment when OpenAI is engaged in raising money at an even crazier valuation of $150bn pre-money.
  • OpenAI has launched a new model (the mythical Strawberry) that is now known as o1 which looking at the data that OpenAI has provided (see here) gives a strong indication that this is a model that can properly reason logical conclusions from first principles.
  • Most impressive is its performance in Competition Math and Competition Code compared to GPT-4o where it scores 83.3 vs 13.4 and 89.0 vs 11.0 respectively.
  • It also significantly improves scores on physics, chemistry, calculus and legal exams but because GPT-4o was already quite good at these, the improvement is much less.
  • o1 has also been built so that it can break down a problem into a series of steps and OpenAI claims that this gives it the ability to correct its own mistakes as it goes along and take a different approach when one is not working.
  • Hence at a high level, o1 appears to offer a substantial improvement in the ability to reason.
  • This is crucial because, as I have said many times, the biggest weakness of all systems based on deep learning is that they have no concept of causality which is what prevents the machines from becoming superintelligent.
  • The first step along this road is the ability to reason which is why if o1 really is reasoning, then we are much closer to superintelligent machines than I or all of the other sceptics have opined.
  • However, there are some problems:
    • First, the basics: where o1 demonstrates some of the usual weaknesses when it comes to the basics of reasoning.
    • In one example it is asked how to play the next move in a game of Tic-Tac-Toe but fails to spot that the position it has been given is impossible (see here).
    • This is a classic reasoning failure and when the unforgiving internet gets its hands on a working version, I am certain that there will be many others.
    • This is a sign that the reasoning being claimed is in fact a simulation based on a data set so massive that the answers to the queries are buried in the dataset and surfaced by the algorithm.
    • Second, this is not a scientific publication: and should only be taken as a press release issued by the PR department during a product launch.
    • OpenAI has not disclosed what the model looks like, how it was trained or any other details.
    • The results in the press release have not been peer-reviewed and testing by the scientific community is still not possible although I am sure this will come.
    • Hence, this should be viewed as a marketing press release until OpenAI makes enough information available for a rigorous peer review of its claims using scientific methods.
    • Third, domain limitations: where it performs brilliantly in the domains it has been trained for but in other domains, older models still fare better.
    • This is a sign that o1 was designed to beat the tests rather than be a general intelligence which again makes me wonder about the timing of the current fundraising.
  • The net result is that this represents a big improvement in areas which are known to be reasoning-heavy (like Math) but I do not see hard evidence of o1 being able to properly reason from first principles.
  • This may come as the Internet tests it to destruction, but I suspect that what the Internet will find is more flaws as opposed to concrete evidence of reasoning.
  • The ability to pass the simple reasoning test of if A=B then it follows that B=A on made-up data would be a big step towards this as, so far, all LLMs have catastrophically failed this most simple test.
  • It is not until the basics are on solid ground can one have faith that the reasoning is real rather than an increasingly sophisticated simulation.
  • Either way, this will get a lot of chatter, and I suspect that this will help support yet another outlandish valuation of OpenAI now pegged at $150bn.
  • More excitement will lead to more investment meaning more demand for Nvidia which remains the safest way to play the generative AI craze directly.
  • However, I continue to prefer the adjacencies of inference at the edge (Qualcomm) and nuclear power as their valuations are far more reasonable and both will still perform well even if generative AI fails to live up to expectations.

RICHARD WINDSOR

Richard is founder, owner of research company, Radio Free Mobile. He has 16 years of experience working in sell side equity research. During his 11 year tenure at Nomura Securities, he focused on the equity coverage of the Global Technology sector.