Disclaimer: I worked at SambaNova Systems between 2019 and 2021, and so I am going to be somewhat careful about discussing sensitive information. Despite the tongue-in-cheek subtitle, I don’t want to get in legal trouble. Please don’t sue me!
In late April, one of the most well-funded AI chip startups out there, SambaNova Systems, pivoted away significantly from their original goal. Like many other AI chip startups, SambaNova wanted to offer a unified architecture for both training and inference. But, as of this year, they’ve given up on their training ambitions, laid off 15% of their workforce, and are focusing entirely on AI inference. And they’re not the first company to be making this pivot.
In 2017, Groq was bragging about their training performance, but by 2022, they were entirely focused on inference benchmarks. The Cerebras CS-1 was originally sold primarily for training workloads, but the CS-2 and later shifted their focus to inference. SambaNova seemed to be the last holdout from that first generation of AI chip startups to still seriously focus on training, but that’s finally changed. So, why are all of these startups pivoting from training to inference? Luckily, as somebody who worked at SambaNova, I have a bit of an insider’s perspective.
Training was always part of the plan
SambaNova was very serious about training models on their hardware. They put out articles about how to train on their hardware, bragged about their training performance, and addressed training in their official documentation. A lot of analysts and outside observers, including me, saw the ability to tackle both the inference and training markets with one chip as a unique edge that SambaNova had over competitors like Groq, which was one of the earliest startups to pivot to inference.
SambaNova also invested significant time and effort into enabling efficient training. When I was at the company between 2019 and 2021, I spent a considerable amount of time implementing a kernel for the NAdam optimizer, a momentum-based optimizer commonly used to train large neural networks. We had hardware and software features designed and optimized for training, and both internal and external messaging indicated that support for training was a key part of our value proposition.
Now, all of a sudden, SambaNova is essentially scrapping most of that work to focus entirely on inference. I think that they’re doing this for three main reasons: the fact that inference is an easier problem to tackle, the fact that inference may represent a larger market than training, and Nvidia’s total dominance in the world of AI training chips.
Inference is an easier, larger market.
Many analysts believe that the market size for AI inference could be ten times bigger than the market for AI training. Intuitively, this makes sense. Normally, you only train a model once, and then perform inference using that model many, many, many times. Each time you run inference, it costs far, far less than the entire training process for a model — but if you run inference using the same model enough times, it becomes the dominant cost when serving that model. If the future of AI is a small number of large models, each with significant inference volume, the inference market will dwarf the training market. But if many organizations end up training their own bespoke models, this future may not come to pass.
But even if inference doesn’t pan out to be a much larger market than training, there are technical reasons why inference is easier to tackle for AI chip startups. When training a model, you need to run a bunch of training data though that model, collect gradient information during the model’s operation, and use those gradients to update the model’s weights. This process is what allows the model to learn. It’s also extremely memory-intensive, as you need to cache all of those gradients as well as other values, like the model’s activations.
So, to efficiently perform training, you need a complex memory hierarchy with on-die SRAM, in-package HBM, and off-chip DDR. It’s hard for AI startups to get their hands on HBM, and hard to integrate HBM into a high-performance system -- so many AI chips like Groq and d-Matrix just don’t feature the necessary HBM or DDR capacity or bandwidth to efficiently train large models. Inference doesn’t have this problem. During the inference process, gradients don’t need to be stored, and activations can be discarded after they’re used. This vastly reduces the memory footprint of inference as a workload, and reduces the complexity of the memory hierarchy inference-only chips need.
Another challenge is inter-chip networking. All of those gradients generated during training need to be synchronized across every chip used in the training process. That means you need a large, complex, all-to-all network to efficiently run training. Inference, on the other hand, is a feed-forward operation, with each chip only talking to the next chip in the inference pipeline.1 Many startups’ AI chips have limited networking capabilities, which makes them poorly suited for the all-to-all connectivity that training training, but sufficient for inference workloads. Nvidia, on the other hand, has addressed both the memory and networking challenges for AI training extremely well.
Nvidia is extremely good at training.
Nvidia has been the hardware of choice for both inference and training since the AlexNet days in 2012. Because of the versatility CUDA grants GPUs, they’re capable of performing all of the necessary operations for both training and inference. And in the past decade, Nvidia has been focused not just on building hyper-optimized chips for machine learning workloads, they’ve also been optimizing their entire memory and networking stack for large scale training and inference.
With access to significant amounts of HBM on every die, Nvidia hardware is able to easily and efficiently cache all of the gradient updates generated by each training step. And with scale-up technologies like NVLink and scale-out technologies like Infiniband, Nvidia hardware is able to handle the all-to-all networking required to update all of the weights of a large neural network after each training step is complete. Inference-only competitors like Groq and d-Matrix simply lack the memory and networking capabilities to compete with Nvidia on training.
But SambaNova chips do have HBM. SambaNova chips have a peer-to-peer network at both the server-level and rack-level. Why can’t they tackle training the way that Nvidia can?
Well, it turns out that Nvidia has more than just HBM and networking to give them a leg up on training performance. They’ve put significant effort into low-precision training, and top AI labs, in turn, have put significant effort into tuning algorithm hyper-parameters to work well with the specific intricacies of Nvidia’s low-precision training hardware. Shifting from Nvidia to SambaNova chips for training requires changing extremely sensitive training code to run on entirely new hardware with an entirely new set of pitfalls. The cost and risk of doing that for a large, GPT-4 scale model is immense.
SambaNova’s pivot to inference is proof that, even if an AI chip startup manages to offer competitive memory and networking capabilities to Nvidia, it’s not enough to take on the green giant in the training market. If a startup wants to challenge Nvidia on training, they need to offer such impressive training performance, that they can overcome Nvidia’s inertia in the training market. And so far, nobody’s been able to pull that off.
This is an oversimplification, but serves to illustrate the different networking requirements for training and inference.
interesting observation, didn't think about this trend at all. any top of mind second-gen chip startups working on training?