It seems like all of a sudden, everybody is talking about Groq, the AI chip startup founded by the original creator of Google’s TPU. Why?
Well, first, they’ve been very good at publicity lately. But specifically, that publicity push is tied to a big new product release: they recently started offering an inference API that serves up open models at record speeds. There’s no question that what they’ve set up is impressive, but there’s one big question: is it enough to actually take a run at Nvidia?
Groq isn’t a newcomer to the AI chip space; they were founded all the way back in 2016 by Jonathan Ross, the original creator of Google’s TPU. But Groq isn’t just a better TPU. Ross and his team developed a new architecture, originally called the Tensor Streaming Processor (TSP) but now rebranded the Language Processing Units -- good marketing strikes again! The TSP/LPU is one giant, chip-sized core that offers a ton of FLOPS very efficiently, using a novel strategy: deterministic compute.
What exactly is deterministic compute?
Groq’s main architectural advantage is that they’ve developed an entirely deterministic architecture, completely controlled by software. What does that actually mean, and why is it more efficient? Because most applications are non-deterministic. Think of a word processor, for example. The word processor doesn’t know the next letter you’re going to type, so it has to be ready to respond to any input it could get. It turns out that a lot of the complexity of modern chips comes from being able to handle non-deterministic computing.
The insight that Groq (and some other similar startups) had is that, if you remove support for those non-deterministic operations, you can get a significant performance boost and a reduction in power consumption. They eliminate features like branch prediction, cache prefetching, and out-of-order execution, and instead devote more of the silicon to matrix multiplication. And it turns out, this tradeoff works really well for AI applications like large language models! If you think about it, there’s no “if statements” inside an LLM. It’s just matrix math from beginning to end.1
How do you schedule your dataflow without any if statements? Groq turns to software. In their stack, it’s the compiler’s job to explicitly and deterministically lay out the entire data-flow of the program from beginning to end. And it seems to work pretty well! If you look at the performance numbers Groq is putting up, they’re really impressive -- so their core insight about deterministic computing seems to be paying off.
However, I’m unsure about how well Groq’s deterministic chips will be able to handle sparsity. When you’re working with a sparse neural network, a lot of your weight and activation values will be zero. You can potentially save time and power by skipping those zeroes. However, deterministic execution and certain kinds of sparsity don’t really play nice with zero-skipping. Before you start running a computation, you don’t know which activations will be zero, so you can’t gain any performance by skipping them.
Current models are fairly dense, so this isn’t an issue, but if sparsity becomes a big part of building more efficient networks in the future, Groq’s performance advantage may decrease. As a reformed neuromorphics guy, I believe that sparsity is the future of efficient AI, which makes me personally a bit wary about Groq’s solution, but ultimately time will tell whether or not I’m right about that. Also, shout-out to Sam, Alex, and Scott at Femtosense, who are building the best sparse AI hardware on the planet!
Where the hell is all the memory?
One thing you might notice when you read Groq’s tech specs is the total lack of HBM or DRAM. When I first saw that, I thought it was a mistake, or that the datasheet I was looking at just didn’t list it because it wasn’t important or impressive enough. But with a bit more digging I found that no, Groq cards literally have no off-chip memory. You have 220 MB of SRAM on each die, and that’s it. That means that if you want to run a large model, you need a lot of chips. To run LLaMA-70B, Groq used 576 chips. That’s 8 entire racks for a 70B parameter model!
What happens if you look at something like GPT-4, which has 1.76 trillion parameters? To run it on a Groq system, you’d need over 10,000 chips. Each rack contains 72 chips, so you’d need 100+ racks per instance of a GPT-4 class system. Plus, at that scale, networking all of those chips starts adding bottlenecks -- currently, Groq’s optical interconnect only scales up to 264 chips, not 10,000.
The thing that still perplexes me is why Groq is committing to an all-SRAM approach. From a 10,000-foot view, Groq is basically a dataflow architecture. Other dataflow architectures, like SambaNova’s, have upwards of a terabyte of DRAM, plus HBM.2 This makes SambaNova’s chips more appealing to different classes of customers: enterprises who want chips for AI inference, but aren’t looking to spend many millions of dollars upfront on racks on racks of chips.
It doesn’t seem like there’s any technical bottleneck stopping Groq from using DRAM and HBM on their chips just like SambaNova does. Instead, I’d guess one of two things is happening. On the one hand this could be a conscious choice to exclusively focus on the largest enterprises, like AI API providers, who are willing to pay a significant upfront cost to serve the lowest latency tokens. But it could also be that Groq just got stuck here. The chips they’re currently running were manufactured in 2020, before the explosion of big models. As early as 2022, their research output was focused on models that could be stored on a single chip (220MB, or about 100 million BF16 parameters). I think there’s a very real chance that Groq ditched HBM for SRAM because they assumed a cluster of 8 220MB cards would be able to store most models. Now that large models are so much larger, they have to double down on 8+ rack deployments, because the chips they already have don’t support HBM or DRAM.
For what it’s worth, there is one advantage of eschewing HBM and DRAM and sticking entirely with SRAM: DRAM and HBM are really expensive, and the chip-on-wafer-on-substrate packaging schemes used to integrate HBM into chips has limited manufacturing capacity. As such, a Groq card probably costs a fair bit less than its Nvidia counterpart, so if you’re already planning on devoting many racks to a single model, Groq might actually offer a lower upfront cost.
The Verdict
It’s impossible to deny that Groq’s design, with local weights and deeply pipelined operation, offers incredible performance when running inference on a single large model. The biggest downside is that their design is very opinionated on how you should be running things. If you’re not devoting hundreds of chips to a single model, all of those amazing performance numbers likely evaporate. This puts Groq in a weird position; they have the best publicly available inference hardware out there right now3, but whether or not their chips are the future of AI hardware somewhat relies on how the AI market shakes out.
For simplicity, I’ll paint two futures for AI inference. In the “many models” future, many of different organizations big and small will have their own custom, fine-tuned models for different tasks. AI inference will be splintered, and devoting hundreds of chips and multiple racks to each model won’t make sense. In this world, Groq hardware isn’t economically viable for most AI inference use cases, no matter how impressive their performance numbers are.
The second future is the “monolithic models” future. In this reality, most models are large, generic models provided to developers via an API. Infrastructure companies can afford to sink millions and millions of dollars into hundreds of racks of Groq chips and put a single massive model on it, because that single model will have enough demand that it’s worth that large upfront cost to maximize performance.
Ultimately, I think Groq is the best hardware solution on the market if you’re doing enough inference volume with a single model where the upfront cost doesn’t matter. As such, it’s a great solution for companies providing APIs, and not a great solution for a lot of other use cases.
Some Closing Thoughts
But even if the “monolithic models” future comes to pass, there’s one more roadblock for Groq: companies building their own hardware. Google already has their TPUs, and Meta is working on inference accelerators like MTIA. If you’re a massive, consolidated infrastructure provider, you might want to build your own chips rather than buying Groq’s. If you’re not a consolidated infrastructure provider and want to run models at smaller scales, you’re not the target customer for a multi-rack system like Groq offers.
That’s why I think Groq offering an API of their own makes total sense. They have the best hardware infrastructure for serving APIs, and the huge API players might try to step onto Groq’s turf and make hardware, so Groq is vertically integrating and providing an API themselves. It’s a smart business decision that’s ultimately driven by the technical requirements of their product: the fastest LLM inference on the planet, if you’re willing to install 8 racks of dedicated hardware for it.
Technically, at the very end of the LLM process, you need to sample a token from a distribution output by the model. But that’s one small step at the very end of the computation.
Full disclosure: I interned and worked full-time at SambaNova in 2019, 2020, and 2021. All of the information about their usage of DRAM and HBM mentioned here is public. I have no stock, options, or financial stake in SambaNova — they’re just an example of a dataflow architecture with DRAM & HBM.
Personally, I think d-Matrix has a shot at winning in terms of overall performance-per-dollar and -per-watt, but we’ll see.