Recently, the Thiel Foundation announced their 2024 Thiel Fellows, a class of bright youngsters who will be paid $100,000 to drop out of college and start businesses. Past Thiel fellows have had some major successes, like Luminar, but whether or not this program actually accomplishes its goals of disrupting higher education is still under debate. Regardless, what surprises me the most is seeing the founders of Etched, an AI chip startup, on the list.
Etched has a simple pitch. Their chips are specialized for running transformers, so they run faster than more generalized GPUs. And Etched isn’t the only company taking this straightforward approach to challenging Nvidia. Etched, SambaNova, and Graphcore were all founded on the same simple idea: if you build a chip that’s optimized for a specific workload like AI, it’ll offer higher performance than a generalized chip. As an example, if you only care about computing SHA hashes, a specialized bitcoin miner is far faster than a GPU, which is in turn faster than a CPU.
Unfortunately, simple and elegant pitches don’t always tell the full story.
Specialization isn’t enough
When it comes to AI, Nvidia isn’t a sitting duck. They clearly realize that AI workloads are incredibly important, and are working hard to add specialized hardware features to their existing GPUs to accelerate LLMs and other models. The Tensor Core was introduced with the Volta architecture in 2017, and each generation has improved Tensor Core performance and added more advanced mixed-precision training features. Nvidia is well-capitalized and some of the best silicon engineers in the world, and they’ve proven over the last few years that they’re highly capable of continuing to innovate on AI hardware.
On the other hand, Nvidia’s startup competitors haven’t been doing so well. Sequoia wrote off its stake in Graphcore after the chip startup lost a major deal with Microsoft. Now, Graphcore is reportedly up for sale. Ultimately, these struggles stem from subpar hardware performance. Graphcore’s latest chip, the Bow IPU, underperforms Nvidia’s 4-year-old A100.
Even though Graphcore’s chips are ostensibly more tailored to AI workloads than Nvidia’s are, Nvidia simply makes more optimized chips overall. Nvidia has in-house teams to design every key part of their silicon. They design their own SRAM arrays and high-speed die-to-die transceivers, both of which are better than foundry-provided designs. Even if Graphcore manages to get a 50% performance boost via specialization, Nvidia gets their own performance boost from optimization. And it’s much easier for Nvidia to specialize than for Graphcore to optimize; hiring the diverse set of engineers needed to design datacenter-class chips with Nvidia’s level of optimization is incredibly expensive for a startup.
How about Etched? Well, they have even less money than Graphcore, which makes executing on custom silicon even harder. They don’t have chips in production yet, so while their performance estimates seem good, they can’t provide any hard numbers to back them up. They’re also planning on outsourcing the physical design of their chips, which significantly limits their ability to optimize their chips' performance and power consumption. They may very well have some secret sauce other than specialization that makes them confident in their ability to compete with Nvidia, but from their public facing materials, they seem to be banking on specialization as their main advantage. And as we’ve seen with Graphcore, specialization isn’t enough.
Ultimately, I think that startups that are trying to beat Nvidia head-to-head are going to meet the same fate as Graphcore. Trying to beat a team of 25,000 incredibly talented chip engineers as a scrappy startup is nearly impossible. If you want to beat Nvidia, you need to counterposition against them.
Counterpositioning
Counterpositioning is one of Hamilton Helmer’s 7 powers, a concept I learned about through the Acquired podcast. Each of the powers represents a way for a business to earn more profits than its closest competitor. Helmer defines counterpositioning as follows:
A newcomer adopts a new, superior business model which the incumbent does not mimic due to anticipated damage to their existing business.
A classic example is early Netflix, back when they were still mailing DVDs. By eliminating physical locations and rental fees, Netflix was able to counterposition relative to Blockbuster. And Blockbuster, with their existing network of franchised retail locations, was unable to replicate Netflix’s strategy.
Nvidia’s existing business is selling large GPUs optimized for AI to datacenter customers. If a startup, like Graphcore, tries to do the same thing, they’re going head to head with a much stronger competitor. If instead that startup leverages a technology or a business model that Nvidia is unable or unwilling to fight them on, they’ll have a much clearer path to success.
Three companies that are having success using this strategy are Groq, Cerebras, and Lightmatter.
Groq: Specializing for Certain Customers
I’ve written about Groq before, but I’ll distill their strategy here. Groq offers incredible inference performance on large models, but only if you buy hundreds of their chips and network them together. If you buy just a few Groq chips, you can either get good performance on small models, or poor performance on large models.
This strategy limits Groq’s customer base: they are missing out on customers who want to have small deployments but still get acceptable performance on large models. Luckily, Groq is a small enough company that they can lose those customers as long as they offer a superior product for other customers.
Nvidia can’t execute this strategy. They’re too large of a company with too many existing customers who fall into the “large models, small deployments” quadrant. If they tried to build chips like Groq, they would lose existing customers, which would impact their bottom line and market cap. By optimizing for a specific customer base, Groq is counterpositioning against Nvidia so that Nvidia can't compete with them head-to-head.
Groq is making a bet here, though. Only time will tell if the “large models, large deployments” quadrant turns out to be the largest market for AI chips. If it’s not, Groq may remain a niche player, with good solutions for a select few customers but without achieving market dominance.
Lightmatter: Building Something Different
Lightmatter describes itself as a “photonic supercomputer company”, and that description isn’t inaccurate. Not only do they make photonic AI chips, they also make photonic interconnects. To quote their CEO, Dr. Nicholas Harris:
The two problems we are solving are ‘How do chips talk?’ and ‘How do you do these [AI] calculations?’ With our first two products, Envise and Passage, we’re addressing both of those questions.
By using light instead of electricity for both computing and communication, Lightmatter can drastically decrease the power consumption of AI workloads, while increasing data transfer speeds in AI datacenters. This new technical paradigm requires building an entirely new technology stack. Lightmatter is developing new fabrication technologies to replace wires with optical waveguides, new architectures to compute using light instead of bits, and new software to map neural networks to photonic chips.
Because Lightmatter’s technology is so different from the technology of conventional chips, they’re counterpositioning themselves compared to Nvidia. Generation by generation, Nvidia has delivered more specialized AI chips. But it would be nearly impossible for Nvidia to slowly move into photonic chips in the same way -- they have to make that entire jump at once, and throw away most of their existing chip designs to do it. And as long as GPUs keep selling, there isn’t a good reason for them to do it. If photonics takes off, Nvidia will definitely make an attempt to compete, but they’ll be starting from behind. And when you’re a startup like Lightmatter competing with a giant, talented incumbent, that’s exactly what you want.
Cerebras: Different Tech for New Customers
Cerebras is famous for making really big chips. And I mean really, really big. Their chip, known as the “wafer scale engine” or WSE, is the size of a dinner plate, while the largest GPUs are only 2 or 3 square inches.
At a very high level, you can think of the WSE as a bunch of GPUs networked together on one giant chip. And Cerebras has had to invent many new techniques to make this work. When you manufacture chips, a couple of chips on each wafer will usually have a manufacturing defect that renders it inoperable, and you have to throw it out. But if your chip is the size of a wafer, every chip will have a defect! So the WSE has techniques to automatically detect and route data around defective cores, enabling the chip to work even with multiple defects in different locations.
Also, delivering power to a chip and cooling a chip that’s bigger than your face is incredibly hard. Each WSE fits into a large, highly specialized system to ensure that it gets sufficient power and to prevent it from overheating. The WSE also generates so much heat that thermal expansion of the chip itself is a major concern, so the chip is mounted in a package that can flex and bend as the chip expands and contracts due to heating and cooling.
Cerebras is also counterpositioning relative to Nvidia by only targeting high-end customers. WSEs are incredibly powerful, but they’re also incredibly expensive. A small research lab which owns a handful of Nvidia 4090s to train models locally is never going to consider spending $2M or more on a single Cerebras system. If Nvidia tried to replace their GPU lineup with WSE-like systems, they would lose a large number of existing, loyal customers. But, just like Groq, Cerebras can focus on a smaller customer group--in this case, high-end customers. And just like Groq, Cerebras is betting that this customer group is going to be the most important in the long term.
If you come at the king…
Ultimately, Nvidia has shown that they’re not scared of doubling down on building chips optimized for AI. If startups want to compete with Nvidia, they need to focus on technologies or customer segments that Nvidia can’t or won’t focus on. For some startups, like Lightmatter, new technologies make it hard for Nvidia to follow in their footsteps. For others, like Groq, focusing on a smaller but growing customer base enables them to specialize in ways that a large public company like Nvidia can’t. And some companies like Cerebras are leveraging both strategies at once.
Admittedly, these counterpositioning strategies bring additional risk to the startups executing on them. New technologies can fail; specific customer bases may not pan out. However, these risks come with large potential rewards. If new technologies do work out, and new customer bases continue growing, these startups could have a significant leg up on Nvidia. At the same time, startups like Etched that are trying to take on Nvidia head-to-head have a much bigger risk: that Nvidia beats them on performance. And as Graphcore has shown, this is a big risk.