Taalas is what Etched should have been.
I’m generating 17,000 tokens per second, and they’re all wrong!
I’ve been an outspoken critic of Etched, the chip company that claims to be making AI chips that beat Nvidia by aggressively focusing on accelerating transformers and reducing programmability. This is a bad idea, because Nvidia is already focused on accelerating transformers, and they have more money and talent than Etched.
Recently, though, a new chip company called Taalas announced their first silicon. They’re also focusing on accelerating transformers by reducing programmability, but in a much, much more aggressive way. Etched wanted to build a chip that could only accelerate transformers; Taalas built a chip that could only accelerate a specific transformer model, namely Llama 3.1 8B. That specific model is hardwired into read-only memory (ROM) inside the chip, and can’t be changed.
This is a huge cost to pay. If you want to run a new model, you have to throw away the old chip and get a new one. But the benefits are also enormous. Taalas’ HC1 proof-of-concept chip generates 17,000 tokens per second, over two orders of magnitude more than their competitors. It’s truly absurdly fast, so naturally, people started going nuts about the tech demo. Today, let’s look at Taalas, separate the hype from the truth, and determine whether baking your model into ROM is actually a good idea.
Things people are getting wrong about Taalas
In the excitement about the Taalas announcement, a lot of people have gotten a lot of things wrong. Before we jump into where I think Taalas could carve out a potential market niche, I want to say what they definitely aren’t going to be able to do. First off: Taalas is never going to be a chip for local inference on consumer devices.
While the Taalas chip is smaller and lower-power than an all-SRAM chip like the ones Groq, d-Matrix, or Cerebras make, it’s still not well-suited for edge devices. These devices care about memory density more than compute. For example, the HC1 stores Llama 3.1 8B, an 8 billion parameter model, with 3 bits per parameter, on a 800mm^2, reticle size logic die. That’s about 3 gigabytes of storage, and requires a large PCIe card’s worth of packaging, power delivery, and cooling. My lower-end iPhone, on the other hand, has 256GB of storage, so it can easily store and run Llama 3.1 8B alongside all of the pictures I’ve taken of my cats. It’ll just be much slower, because weights will be swapped around between the SSD, DRAM, and on-chip SRAM, but it’s much more practical than strapping a PCIe card to the back of your smartphone or laptop.
I also don’t think this is as big of a game-changer for coding agents that some people think it is. At least in my experience, developers significantly prefer using the newest, smartest models for coding tasks, even if they are more expensive. Giving an old model a ton of reasoning time is more likely to ruin your codebase than suddenly fix all your bugs. And given that most users of coding agents are enterprise customers with large budgets, I’d expect most of them will care about quality and running the most up-to-date models for their coding tasks. That makes the Taalas solution a poor fit.
On the other hand, a lot of the more “banal” consumer use-cases for AI aren’t nearly as sensitive to the underlying model quality. If you’re using an LLM to write high-performance Rust code, help diagnose rare diseases, or analyze the structural integrity of a bridge, you want the smartest model available. If you want an AI boyfriend to vent to or an alt girl AI gf to get off to, you’ll be fine with last year’s version of DeepSeek. I think those consumer use-cases, where price and speed matter more than quality, are going to be Taalas’s bread and butter.
There’s another, more technical reason why I think Taalas’ chips are going to be better suited to ultra-low-cost consumer use-cases, rather than enterprise use-cases: how large context lengths affect the side of a model’s KV cache.
KV cache and context length
If you want to use an LLM for coding, you usually need to feed it a ton of input – usually, a significant portion of your codebase. It also usually generates fairly long reasoning traces, where the model talks itself through a complex problem before presenting a solution. These both pose challenges for Taalas’s architecture.
Transformer models consist of two main compute-heavy operations, each containing a ton of matrix multiplications: feed-forward layers, and multi-head attention. In the feed-forward layers, input data is multiplied by weights; because weights are fixed after training, Taalas can store them in ROM and access them extremely quickly and efficiently. In the multi-head attention layers, however, the input sequence is transformed to produce key (K), query (Q), and value (V) matrices that get longer as the input sequence grows. Because these K, Q and V matrices change with the input sequence, they have to be stored in on-chip SRAM, rather than being pre-stored in ROM like the model weights.
For extremely long context lengths, Taalas’ chips will start to be bottlenecked by the KV cache operations in SRAM, even if their feed-forward layers run extremely quickly. These long context lengths show up all the time in enterprise use-cases (legal briefs, medical reports, and especially code), but not in consumer use-cases. This is just further evidence that Taalas’s niche may be delivering models for consumers to talk to, rather than for enterprises to develop software with.
However, it’s unclear to me whether consumer AI is going to be nearly as large of a market as enterprise AI. And Taalas is paying a massive cost to deliver chips that are so fast.
The logical conclusion of “speed at any cost”
Taalas chips may be much more power efficient than their competitors, but the total cost of ownership (TCO) of a Taalas system is truly staggering. The Taalas HC1 may just be one chip, but if you want to implement a larger model using Taalas’s technology, all of its weights won’t fit on a single chip. Instead, you need to manufacture multiple unique chips for different sections of the model. Taalas’s own predictions assume that a model like Deepseek R1 would require 30 unique chip tape-outs.
Now, if you stay up-to-date with the industry or read SemiAnalysis you’re probably gasping at the astronomical cost of 30 tapeouts. A cutting edge tapeout normally costs tens of millions of dollars – with those numbers, a Taalas system would cost hundreds of millions in mask sets alone, before you even manufacture a single chip. But they’re doing something clever. They can change the weights, matrix dimensions, and other various control parameters with only two mask layers. The other mask layers in the chip, of which there are often over 10, are unchanged. That means that most of the cost of doing many different tapeouts is actually shared, and a Taalas system might cost “only” $100M or less in mask sets to deploy.
Taalas says that their chips are so efficient that it makes sense to pay this massive upfront cost. If a datacenter leverages Taalas chips, they’re spending less power, and therefore less money, on each token, and can also use fewer chips to achieve the same throughput. So the longer they use the Taalas system, the bigger its cost advantages are. But because the Taalas systems are hardwired to only use one model, it’s also the case that the longer they have the Taalas system, the more out-of-date it becomes. Taalas says the economics work out if you assume a 1-year lifetime of each Taalas datacenter.
Ultimately, from a technology perspective, Taalas has a new, cool idea. Unlike Etched, there’s no hand-wavey magic here – it makes complete sense how wiring weights into ROM can deliver massive performance increases. But if nobody is willing to use out-of-date LLMs, all those performance issues are for naught. Given that the biggest LLM application right now is coding, which should have a strong preference for the most up-to-date models, I’m not sure if Taalas will be a massive success. But if consumer AI use-cases continue to grow, Taalas hardware could be powering that wave of consumer AI apps that leverage last year’s models running lightning fast.

