Google vs. Nvidia – Lightspeed

Google fails to dent Nvidia.

  • Google’s claims of superior performance for its AI chips is somewhat misleading as it looks like the performance difference is to do with inter-chip data transport rather than processing speed and it benchmarks its products against one of Nvidia’s older products.
  • Google has published a paper (see here) lauding the capabilities of its in-house designed 4th generation TPU when 4,096 units are combined into a single supercomputer.
  • The key difference between Nvidia and Google here is how the chips move data between one another which is a critical function when using 4,096 chips for a single task such as LLM training.
  • Google is using optical switching whereas Nvidia’s product is based on the InfiniBand standard that is manufactured by Mellanox which Nvidia acquired some time ago.
  • Google is claiming a 2.7x improvement in performance per watt over TPU v3 and a 1.5x speed advantage over the Nvidia A100 while consuming 1.6x less power.
  • Unfortunately, while Google was measuring its latest creation, Nvidia released the H100 which it claims is 9x faster for model training and 30x faster when processing requests on a pre-trained model (inference) than the A100.
  • Using Google’s figures, it is easy to see that its latest product is still quite far behind Nvidia’s H100 which probably explains why very few people use Google servers and its TPUs to train their models.
  • Instead, the trend has been to build one’s own server from Nvidia chips or to rent one from Microsoft or AWS.
  • This is also why I suspect that Google Cloud has bitten the bullet and done a deal with Nvidia to run the H100 in its servers.
  • This does not mean that Google is giving up but it does mean that Google Cloud clearly thinks that it needs to make Nvidia resources available in order to win new customers.
  • Google Cloud remains a very weak 3rd compared to AWS and Azure which is what I suspect lies behind its move to support Nvidia in its data centres.
  • It also looks like Google’s improvement over the A100 is coming from using optical networking between its chips rather than the InfiniBand standard that Nvidia uses.
  • InfiniBand can support optical, but Google’s proprietary twisted 3D torus topology is much faster and consumes less power according to Google.
  • This is often the advantage of using a proprietary technology as you can specify it exactly to one’s own requirements but this does mean that it won’t work with anyone else’s technology.
  • As a chip supplier rather than a service provider, Nvidia does not really have this luxury and so needs to go with a standard that everyone recognizes and can work with.
  • Google’s proprietary technology is unlikely to ever make it out of its data centre.
  • Hence, I don’t think that this new technology from Google means very much for Nvidia and I don’t see everyone rushing to Google Cloud to make use of its TPUs.
  • In fact, quite the opposite seems to be happening as Google Cloud will now support Nvidia in order to win new clients.
  • Nvidia remains in pole position to benefit from the current feverish excitement surrounding generative AI which is almost certain to see excessive spending on training and inference capacity.
  • This will be a direct and immediate benefit to Nvidia.
  • This will come to an end when the money runs out and the results of the spending do not live up to the extravagant promises that have been made.
  • It is at that time that Nvidia’s very high valuation may begin to weigh on its share price but for now, the only way looks to be up.

RICHARD WINDSOR

Richard is founder, owner of research company, Radio Free Mobile. He has 16 years of experience working in sell side equity research. During his 11 year tenure at Nomura Securities, he focused on the equity coverage of the Global Technology sector.