zach's tech blog

Crossbar’s new security chip isn’t actually secure.

zach — Mon, 16 Mar 2026 17:38:09 GMT

Crossbar, Inc is a company that was founded to commercialize resistive random access memory (RRAM), an emerging nonvolatile memory technology that’s one of the candidates to replace eFlash in modern process nodes. In the past few years, they’ve been starting to focus on ReRAM’s unique security properties. Compared to flash memory, it’s more robust to laser voltage probing attacks, and it can be used as a physically unclonable function to generate stable, secure random keys inside the chip.

Licensing secure memory and PUF IP isn’t super lucrative, though, so Crossbar decided to vertically integrate and start designing their own security chips and hardware products that integrate those chips. But Crossbar didn’t want to design their chip in the way that most security chips are designed: closed-source and highly secretive. Instead, they worked with the legendary hardware hacker Bunnie Huang to partially open-source their chip design.1 This is really cool, and it’s exciting to see more companies embracing open-source silicon; you can go look at all the RTL right here.

But there’s one issue with making your silicon open-source: everybody can see your mistakes. That includes people like me, who have been designing highly secure hardware for years. And Crossbar’s implementation of masked AES has a significant flaw that makes it fundamentally insecure against side-channel attacks. Today, we’re going to be looking at what went wrong, and how they could have done better.

What makes a chip side-channel secure?

Whenever the value on a wire in a chip switches from a 0 to a 1 or vice-versa, it consumes a small amount of power. If you’re trying to design a secure chip that’s processing sensitive cryptographic keys, this can be a problem. For example, let’s say you want to send a key from one part of a chip to another along a bus. The amount of power consumed by that transaction is going to be related to the value of that cryptographic key. If an attacker measures the power consumption of this chip over a lot of transactions, they can reconstruct the value of the secret key. This is called a power analysis side-channel attack.

Luckily, there are ways to prevent these side-channels. By XOR-ing the key data with random data, the power consumption of sending data along wires becomes randomized.

Unfortunately, it’s harder to mask actual computations this way. If we want to implement a masked AND gate, we can’t just XOR each input with a mask and then XOR the output by the same mask to recover the desired output value. We have to be more clever.

In 2003, Elena Trichina developed an implementation of an AND gate that was supposed to be secure against side channel attacks. Specifically, the Trichina AND gate was “1-probe secure”: the value of the secret key never appeared directly on a wire in the chip, so a single probe connected to any wire on the chip could not recover it. That (roughly) meant that the power consumption of the chip wouldn’t be proportional to the secret key.2 This notion is important: for a chip to be secure against power analysis, the real value of the secret key should never appear directly on a wire.

Crossbar’s implementation is similar to Elena Trichina’s design, with some minor improvements. Unfortunately, this kind of design was shown to be insecure as early as 2005.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

Why isn’t Crossbar’s implementation secure?

The physical logic gates and wires that make up a digital circuit all have different amounts of propagation delay through them; data doesn’t flow through these circuits instantly. This can result in behavior called “glitches”, where intermediate values in a digital circuit transition from 0 to 1 and back multiple times during a single cycle as values propagate through the logic. In practice, glitches account for a significant portion of a chip’s power consumption… so they should also contribute to side-channel leakage.

An example of a glitch in a logic circuit. (Wikipedia)

In early 2005, a paper confirmed that these sorts of glitches actually do cause side-channel leakage. By late 2005, glitching power consumption had been used to compromise masked AES implementations in practice.

These glitching attacks work because, even when the value of the secret key is never exposed as the final state of a wire, it can become exposed as the intermediate state of a wire during a glitching event. This has the effect of compromising 1-probe security, which means that the power consumption of the chip starts to be correlated to the value of the secret key. This means that Crossbar’s chip isn’t actually secure against power analysis attacks.

There are various modern techniques that hardware designers use to implement side-channel secure algorithms even in the presence of glitches. The most common implementation I’ve seen in industry is domain-oriented masking (DOM), which inserts register stages into a design to force signals to arrive in a certain order and prevent glitches that would leak data. Applying DOM to a circuit increases its latency, but overall, it’s a fairly resource-efficient technique for implementing a masked algorithm. Unfortunately, Crossbar didn’t decide to leverage DOM or a similar, glitch-secure masking scheme in their design.

I have to give kudos to Crossbar for open-sourcing their RTL for this chip. So much of the silicon we rely on every day is closed source -- especially security chips. But open-source security silicon poses some challenges. This isn’t like software, where bugs can be fixed with a simple update. Crossbar is putting these chips, which have clear security flaws, into real products, and expecting people to use them to store their passkeys and custody their cryptocurrency. I respect Crossbar for trying to make open-source silicon, but they’re lacking the collaborative spirit that makes open-source development work so effectively.

On open-source silicon.

The right way to build open-source silicon isn’t just to take your design and publish the RTL online. It’s important to collaborate with the broader open-source community, which is aware of these sorts of attacks and how to prevent them. OpenTitan has a side-channel secure AES implementation that leverages domain-oriented masking. On the topic of a fully combinational masked S-box, like the one Crossbar is using, they say this (emphasis mine):

Alternatively, the two original versions of the masked Canright S-Box can be chosen as proposed by Canright and Batina: “A very compact “perfectly masked” S-Box for AES (corrected)“. These are fully combinational (one S-Box evaluation every cycle) and have lower area footprint, but they are significantly less resistant to SCA. They are mainly included for reference but their usage is discouraged due to potential vulnerabilities to the correlation-enhanced collision attack as described by Moradi et al.: “Correlation-Enhanced Power Analysis Collision Attack”.

If Crossbar had leveraged OpenTitan’s DOM AES implementation, not only would they have a more secure implementation, but it would serve as additional silicon proof of an agreed-upon implementation that’s gathering support among many different open-source silicon developers. Instead, Crossbar developed their own implementation, with flaws, and released it. Again, I respect them for making their silicon open source, but I hope open-source silicon developers in the future work more closely with the open-source community to ensure their designs are secure.

And one last note before I go: I would discourage any readers from buying any Crossbar PHSM devices due to this security flaw.

Bunnie didn’t work on the cryptographic hardware accelerators, and this post is not a critique of him or any of his work on the Baochip-1x project. It is, however, a critique of Crossbar, because a company selling security chips should not be making this mistake.

The d-probing model and how it relates to power side-channels is way beyond the scope of this article. Another time!

Taalas is what Etched should have been.

zach — Mon, 09 Mar 2026 12:55:55 GMT

I’ve been an outspoken critic of Etched, the chip company that claims to be making AI chips that beat Nvidia by aggressively focusing on accelerating transformers and reducing programmability. This is a bad idea, because Nvidia is already focused on accelerating transformers, and they have more money and talent than Etched.

Recently, though, a new chip company called Taalas announced their first silicon. They’re also focusing on accelerating transformers by reducing programmability, but in a much, much more aggressive way. Etched wanted to build a chip that could only accelerate transformers; Taalas built a chip that could only accelerate a specific transformer model, namely Llama 3.1 8B. That specific model is hardwired into read-only memory (ROM) inside the chip, and can’t be changed.

This is a huge cost to pay. If you want to run a new model, you have to throw away the old chip and get a new one. But the benefits are also enormous. Taalas’ HC1 proof-of-concept chip generates 17,000 tokens per second, over two orders of magnitude more than their competitors. It’s truly absurdly fast, so naturally, people started going nuts about the tech demo. Today, let’s look at Taalas, separate the hype from the truth, and determine whether baking your model into ROM is actually a good idea.

Things people are getting wrong about Taalas

In the excitement about the Taalas announcement, a lot of people have gotten a lot of things wrong. Before we jump into where I think Taalas could carve out a potential market niche, I want to say what they definitely aren’t going to be able to do. First off: Taalas is never going to be a chip for local inference on consumer devices.

While the Taalas chip is smaller and lower-power than an all-SRAM chip like the ones Groq, d-Matrix, or Cerebras make, it’s still not well-suited for edge devices. These devices care about memory density more than compute. For example, the HC1 stores Llama 3.1 8B, an 8 billion parameter model, with 3 bits per parameter, on a 800mm^2, reticle size logic die. That’s about 3 gigabytes of storage, and requires a large PCIe card’s worth of packaging, power delivery, and cooling. My lower-end iPhone, on the other hand, has 256GB of storage, so it can easily store and run Llama 3.1 8B alongside all of the pictures I’ve taken of my cats. It’ll just be much slower, because weights will be swapped around between the SSD, DRAM, and on-chip SRAM, but it’s much more practical than strapping a PCIe card to the back of your smartphone or laptop.

I also don’t think this is as big of a game-changer for coding agents that some people think it is. At least in my experience, developers significantly prefer using the newest, smartest models for coding tasks, even if they are more expensive. Giving an old model a ton of reasoning time is more likely to ruin your codebase than suddenly fix all your bugs. And given that most users of coding agents are enterprise customers with large budgets, I’d expect most of them will care about quality and running the most up-to-date models for their coding tasks. That makes the Taalas solution a poor fit.

On the other hand, a lot of the more “banal” consumer use-cases for AI aren’t nearly as sensitive to the underlying model quality. If you’re using an LLM to write high-performance Rust code, help diagnose rare diseases, or analyze the structural integrity of a bridge, you want the smartest model available. If you want an AI boyfriend to vent to or an alt girl AI gf to get off to, you’ll be fine with last year’s version of DeepSeek. I think those consumer use-cases, where price and speed matter more than quality, are going to be Taalas’s bread and butter.

There’s another, more technical reason why I think Taalas’ chips are going to be better suited to ultra-low-cost consumer use-cases, rather than enterprise use-cases: how large context lengths affect the side of a model’s KV cache.

KV cache and context length

If you want to use an LLM for coding, you usually need to feed it a ton of input – usually, a significant portion of your codebase. It also usually generates fairly long reasoning traces, where the model talks itself through a complex problem before presenting a solution. These both pose challenges for Taalas’s architecture.

Transformer models consist of two main compute-heavy operations, each containing a ton of matrix multiplications: feed-forward layers, and multi-head attention. In the feed-forward layers, input data is multiplied by weights; because weights are fixed after training, Taalas can store them in ROM and access them extremely quickly and efficiently. In the multi-head attention layers, however, the input sequence is transformed to produce key (K), query (Q), and value (V) matrices that get longer as the input sequence grows. Because these K, Q and V matrices change with the input sequence, they have to be stored in on-chip SRAM, rather than being pre-stored in ROM like the model weights.

For extremely long context lengths, Taalas’ chips will start to be bottlenecked by the KV cache operations in SRAM, even if their feed-forward layers run extremely quickly. These long context lengths show up all the time in enterprise use-cases (legal briefs, medical reports, and especially code), but not in consumer use-cases. This is just further evidence that Taalas’s niche may be delivering models for consumers to talk to, rather than for enterprises to develop software with.

However, it’s unclear to me whether consumer AI is going to be nearly as large of a market as enterprise AI. And Taalas is paying a massive cost to deliver chips that are so fast.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

The logical conclusion of “speed at any cost”

Taalas chips may be much more power efficient than their competitors, but the total cost of ownership (TCO) of a Taalas system is truly staggering. The Taalas HC1 may just be one chip, but if you want to implement a larger model using Taalas’s technology, all of its weights won’t fit on a single chip. Instead, you need to manufacture multiple unique chips for different sections of the model. Taalas’s own predictions assume that a model like Deepseek R1 would require 30 unique chip tape-outs.

Now, if you stay up-to-date with the industry or read SemiAnalysis you’re probably gasping at the astronomical cost of 30 tapeouts. A cutting edge tapeout normally costs tens of millions of dollars – with those numbers, a Taalas system would cost hundreds of millions in mask sets alone, before you even manufacture a single chip. But they’re doing something clever. They can change the weights, matrix dimensions, and other various control parameters with only two mask layers. The other mask layers in the chip, of which there are often over 10, are unchanged. That means that most of the cost of doing many different tapeouts is actually shared, and a Taalas system might cost “only” $100M or less in mask sets to deploy.

Taalas says that their chips are so efficient that it makes sense to pay this massive upfront cost. If a datacenter leverages Taalas chips, they’re spending less power, and therefore less money, on each token, and can also use fewer chips to achieve the same throughput. So the longer they use the Taalas system, the bigger its cost advantages are. But because the Taalas systems are hardwired to only use one model, it’s also the case that the longer they have the Taalas system, the more out-of-date it becomes. Taalas says the economics work out if you assume a 1-year lifetime of each Taalas datacenter.

Ultimately, from a technology perspective, Taalas has a new, cool idea. Unlike Etched, there’s no hand-wavey magic here – it makes complete sense how wiring weights into ROM can deliver massive performance increases. But if nobody is willing to use out-of-date LLMs, all those performance issues are for naught. Given that the biggest LLM application right now is coding, which should have a strong preference for the most up-to-date models, I’m not sure if Taalas will be a massive success. But if consumer AI use-cases continue to grow, Taalas hardware could be powering that wave of consumer AI apps that leverage last year’s models running lightning fast.

Why is OpenAI partnering with Cerebras?

zach — Mon, 02 Feb 2026 16:09:56 GMT

A few weeks ago, OpenAI announced a massive new partnership with Cerebras, the wafer-scale AI chip company. This came as a big surprise to those of us who have been watching Cerebras for years -- the AI chip startup had, until recently, no major customers other than the UAE-based G42, who was also an investor in the company. That lack of customers primarily stemmed from the total cost of ownership (TCO) of Cerebras systems.

Like other AI accelerators that lack high-bandwidth memory (HBM), Cerebras needed to network together many chips to host a single large model. The relatively small Llama 3.1 70B LLM takes 4 racks of Cerebras wafer-scale systems, each of which costs between two and three million dollars. An Nvidia DGX H100 can run the same workload at a tenth of the cost. And the TCO difference only grows for larger models, which require more and more Cerebras chips to run without performance plummeting due to model weights being moved off-chip.

Cerebras would claim that this cost overhead is worth it for the ultra-low latency that they provide. They’re not wrong that Cerebras chips, by storing all of their data in SRAM, offer significantly lower inference latency than equivalent Nvidia solutions. The same is true for other SRAM-based inference chips, like d-Matrix and the recently-acquired Groq. But players like SambaNova and Furiosa also offer relatively high-speed inference without the same cost tradeoff, due to their inclusion of HBM. So if Cerebras’ chips are so expensive, and their low-latency inference isn’t that unique, why is OpenAI partnering with them?

In my opinion, what actually makes Cerebras unique is that, after the Groq acquisition, they are the only major AI chip startup that has already deployed chips at scale that isn’t limited by TSMC’s chip-on-wafer-on-substrate (CoWoS) manufacturing capacity.

CoWoS, HBM, and SRAM

As I discussed in my article on Nvidia’s Groq acquisition, Nvidia’s chip manufacturing volume is limited by TSMC’s CoWoS manufacturing capacity. CoWoS packaging is expensive, difficult, and has limited production volume. Nvidia isn’t the only company fighting for that limited volume; AMD uses CoWoS packaging for their AI chips as well. Plus, startups are trying to leverage CoWoS, with SambaNova also using CoWoS packaging, and other startups like Furiosa, Etched, and MatX probably will as well. CoWoS capacity shortages are the norm, and are predicted to persist for years.

CoWoS packaging is key to the HBM-based architectures that Nvidia, AMD, and startups like SambaNova and Furiosa are selling. With massive, low-latency banks of HBM, these chips can store LLM weights, biases, and KV cache values off-chip while still accessing them relatively quickly. This high memory density enables a datacenter to serve small models like Llama 3.1 70B with less than one rack of servers, offering a cost-efficient way to deliver inference.

On the other hand, SRAM-based architectures, like the ones developed by Cerebras, Groq, and d-Matrix, have to store all that data on-chip. On-chip SRAM memory is far, far less dense and far, far more expensive than off-chip SRAM. That’s why it takes over ten million dollars to build out a Cerebras system for even a small model. On the other hand, SRAM is much faster than HBM, which lets these chips run inference workloads faster than their HBM counterparts.

For years, analysis and the market have been looking at this tradeoff and deciding it wasn’t worth it. Nobody wanted to pay orders of magnitude more for a system that was slightly faster. That’s why Groq had to pivot from selling chips, which nobody would buy, to running their own inference clusters. It’s also why Cerebras’ only major customer was G42, the Emirati company that was also one of their major investors.

But the limited supply of CoWoS packaging changes the equation. If every bit of TSMC capacity is used for Nvidia and AMD chips that are already flying off the shelves, what do you do when you want even more compute? You might need to start buying chips with a worse cost-to-performance tradeoff.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

The AI bubble doesn’t care about cost

OpenAI is in talks to raise as much as $100 billion dollars. Anthropic is raising $20 billion. And existing tech giants like Meta and Google are pouring countless billions into their own AI efforts. Put simply, the AI giants have more money than they know what to do with. And many of them see the AI race as a winner-take-all effort. Any small advantage now could be worth trillions of dollars in the future if it gives a company a substantial competitive edge in developing and deploying the next models sooner.

What Cerebras is offering is just that. It’s not actually a better chip than what Nvidia has -- in fact, it offers a worse cost-to-performance tradeoff. But the demand for AI chips is so high that “almost as good as Nvidia but still worse” has become a multi-billion dollar value proposition. Every single one of Nvidia’s chips is already being deployed in AI datacenters, and they can’t make more unless TSMC adds additional CoWoS manufacturing capacity. Cerebras is able to fill the excess demand for compute that exists past that limited TSMC manufacturing capacity.

As a brief aside — Cerebras is not the only player offering all-SRAM inference chips. But Groq was just acquired by Nvidia, and the only other player in this space, d-Matrix doesn’t have chips deployed at scale in the way Cerebras does. If OpenAI wants to source additional compute, even at a relatively high cost, it makes sense to pick the player with the lowest risk. In this case, that’s Cerebras.

Ultimately, as the demand for compute grows, the cost-efficiency and power-efficiency of different chips starts to matter less and less, and the availability of those chips starts to matter more. This explains Nvidia’s acquisition of Groq, and it explains OpenAI’s Cerebras deal. Sure, these chips might be slightly worse than Nvidia’s H100 and B200s at serving large models at scale, but Nvidia’s chips are a limited resource. When the demand for AI chips is so high that Nvidia is running out of stock, “almost as good as Nvidia” becomes a viable strategy.

Why did AMD’s random number generator start returning 0x0000?

zach — Mon, 05 Jan 2026 18:20:54 GMT

If there’s one thing we love here at Zach’s Tech Blog, it’s weird security vulnerabilities in processors. So when I saw that there was a security flaw in AMD’s Zen 5 processors that made its random seed instruction, RDSEED, start returning far too many zeroes to be truly random, I knew I had to write about it. Plus, it’s a really great excuse to talk about one of my favorite topics in cryptography: why elliptic curve signatures are really hard to implement securely.

Generating true random numbers is really important for cryptography. Some algorithms don’t require particularly special random numbers, but others, like the elliptic curve digital signature algorithm (ECDSA), are incredibly sensitive to the “quality” of their randomness. If you’re using ECDSA, even slight biases in your random numbers can allow attackers to extract your private key, given enough signatures.

What makes a random number generator good enough for cryptography, then? Well, its output should be computationally indistinguishable from truly random numbers. That’s usually measured by standard statistical tests. Usually, the best random number generators are so-called true random number generators, or TRNGs, which use randomness extracted from the environment. But the specific flaw AMD disclosed lay within their implementation of the RDSEED instruction, which is supposed to generate random output using their TRNG. This should be the best source of randomness on the chip, but it suddenly started returning far more zeroes than expected. So what went wrong?

What was wrong with AMD’s TRNG?

Most TRNGs deployed in modern CPUs use some form of analog noise on the chip to generate truly random numbers. While AMD hasn’t publicly confirmed their TRNG’s randomness source, it’s most likely based on free-running ring oscillators. The frequency of free-running ring oscillators changes randomly based on temperature, jitter, power supply noise, and all other sorts of random environmental factors. If you sample the state of that oscillator with your much-more-stable system clock, you can generate random bits.

Ring Oscillator TRNG from Wold et. al.

Unfortunately, the bitstream you generate from your ring oscillator is often biased in some way -- a specific chip may randomly sample more ones than zeroes, simply due to the process variation on that piece of silicon. So it’s common to use some form of randomness extraction, conditioning, and whitening algorithms to post-process your raw random bits into a set of uncorrelated random bits suitable for use in a cryptographic algorithm. Modern cryptographic TRNGs usually use a combination of Von Neumann conditioning and hash based conditioning to post-process their random source.

The Von Neumann conditioning algorithm is particularly simple and clever -- it groups output bits two-by-two, discards any pair where the bits are the same, and converts 01 into 0 and 10 into 1. This removes simple biases from the raw bitstream:

Von Neumann Extractor, from Clement Pit-Claudel

One big problem with this TRNG architecture is that it’s just a bit slow. Multiple ring oscillators are combined to return one raw random bit per cycle. The Von Neumann conditioning algorithm ends up discarding more than half of those bits to remove bias. Then, bits need to pass through a cryptographic hash function for additional post-processing before they can actually be used.

Because of this, modern CPUs don’t generate randomness on-demand, whenever a user invokes the RDSEED — it would be too slow. Instead, CPUs are constantly generating random numbers and storing them in a buffer, which the RDSEED instruction fetches random numbers from. But this means that, if a user invokes RDSEED too often, the buffer can run out of random numbers. In that case, the instruction is supposed to return with its carry flag, CF, set to 0, which indicates to software that the randomness buffer was not ready and that it should try again.

The issue with AMD’s Zen5 processors is that, for the 16-bit and 32-bit versions of the RDSEED instruction, CF gets set to 1 even if the randomness buffer is empty. And when the randomness buffer is empty, it returns zero, rather than a real random value.

But wait -- is it really that bad if a TRNG returns zero slightly more than it should? Well, it turns out that it can actually be extremely bad. In fact, even slight biases in random numbers can be catastrophic.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

ECDSA and the PlayStation 3

Elliptic curve cryptography (ECC) replaced the RSA cryptosystem in the mid-2000s because of its ability to provide similar security levels with much smaller keys: a 256-bit ECDSA key has comparable security to a 3072-bit RSA key. However, RSA-based signatures had one major advantage over the elliptic curve digital signature algorithm (ECDSA). Performing an ECDSA signature requires generating an ephemeral random number called a nonce. If that nonce is reused, attackers can easily determine the private signing key, as they famously did to extract the signing key for the PlayStation3.

Don’t be like Sony (fail0verflow, 2010)

Luckily, it’s pretty easy to avoid making this specific implementation mistake. Unfortunately, even if you change the nonce with every signature, slight biases in the RNG used to generate the nonce will still leak the private key! Each time an ECDSA key is used to generate a signature, if the nonce is biased, it leaks some information about that key; with enough signatures, an attacker can reconstruct the key. This happened in the real world when maintainers of the Debian-specific fork of the OpenSSL package modified their PRNG in a way that (hopefully accidentally) introduced significant bias. If a software developer was using AMD’s RDSEED instruction to generate ECDSA nonces, they could end up signing multiple messages with a nonce of zero, or a nonce biased towards containing more zero bits in it, and compromising their private keys.

Thankfully, there are ways to work around the TRNG bug in AMD’s chips. The 64-bit version of RDSEED isn’t affected, so software developers can just use that instruction without concern. But it’s an interesting lesson in how TRNGs work and the ways very minor bugs can cause very major security flaws when they’re integrated into a full system.

Why did Nvidia acqui-hire Groq?

zach — Fri, 26 Dec 2025 22:19:40 GMT

On December 24th, the semiconductor industry got an early Christmas present with the announcement that Nvidia was acquiring Groq. The CNBC article breaking the story originally read:

Within the hour, Groq made their own announcement:

Instead of talking about an acquisition, Groq described a non-exclusive licensing agreement in which key Groq executives and engineers joined Nvidia, but Groq continued as an independent company. Shortly afterwards, the CNBC story was updated to reflect this reality.

Groq also didn’t mention the $20B number anywhere in their announcement, and Nvidia has yet to make any public comment. However, I’ve since gotten independent confirmation from an anonymous source with knowledge of the deal that it was, in fact, worth $20B.

It’s not a good look that startups are getting gutted in weird reverse-acqui-hire deals that strip key talent and technology without necessarily bringing all of the employees along for the ride. But I’m not an expert on M&A regulations, and this is Zach’s Tech Blog, not Zach’s Regulatory Compliance Blog, so I’m not going to discuss the specific deal structure here.

The important thing is that Nvidia did genuinely pay twenty billion dollars for Groq’s technology and key talent. From the outside, Groq is not worth $20B. Their projected revenue for 2025 was $500M, so the price-to-sales ratio for this deal was an eye-watering 40 — 4x higher than the ratio of companies like Apple and Google. The first version of the LPU chip was taped out in 2020, making it woefully outdated compared to other companies’ more recent chips. The TCO of a Groq cluster is unreasonably high, requiring hundreds of chips worth millions of dollars in total to run relatively small open-source models like Llama70B. Jonathan Ross and the other Groq executives don’t have a deep knowledge of Google’s TPU architecture, as some analysts are claiming; they left Google to found Groq after the first generation of the TPU, and Google is on the 8th generation by now. So… why do I think Nvidia spent $20B on Groq? Here are a few possibilities.

Fair warning: this is all wild speculation. But speculation is fun!

The LPUv2 is something special.

Dylan Patel of SemiAnalysis was quoted in the Information with this theory:

“Groq’s first-generation chips were not competitive [with Nvidia’s chips], but there are two [more] generations coming back-to-back soon,” said Dylan Patel, chief analyst at chip consultancy SemiAnalysis. “Nvidia likely saw something they were scared of in those.”

He’s right about the LPUv1 being an uncompetitive chip. As for the LPUv2, Groq has done a very good job keeping architectural details of their new chips from leaking. We do know that it’s being manufactured in Samsung’s 4nm process, which may be important — more on that later.

But unless Groq has cooked up some 3D stacked SRAM architecture that blows Nvidia out of the water, I’m personally doubtful that there’s something in the LPUv2 that makes it significantly better than Nvidia’s chips. Groq had some major layoffs and lost a ton of great talent to attrition, including their old Chief Architect. That’s not a great recipe for a company that’s suddenly able to out-perform Nvidia at scale.

Groq has some unique partnership Nvidia wants.

Maybe Groq has some sales partnership or other business deal that Nvidia really wants access to. Given the “non-exclusive licensing deal” structure, I’m not sure how that would work, but it’s certainly a possibility. For all of its technical weaknesses, Groq’s business development team has proven to be particularly shrewd, turning mediocre and outdated silicon into multiple huge funding rounds. They were the first AI chip startup to pivot to selling inference tokens as their primary product, rather than trying to sell chips to hyperscalers — and most of their competitors have been following suit.

I’m not sure what partnership Groq has that Nvidia might want, though. The only major AI lab that Groq has a partnership with is Meta, which offers Groq’s cloud as an optional inference provider in the Llama4 API. But that partnership isn’t unique — Cerebras is offered as another API inference provider on equal footing to Groq.

A lot of Groq’s other big partnerships are with Middle Eastern countries and governments. They have partnerships with the Saudi Arabian government as well as various Saudi companies to deploy Groq systems at scale. While these partnerships are valuable, they’re certainly not unique. Cerebras also has a large number of Middle Eastern partnerships — in their case, with the UAE.

So Groq doesn’t have any publicly announced partnerships that are so unique that they could justify a $20B price tag. There could be something that isn’t public yet, of course, but going purely based on public info, I think there’s only one thing that could make Groq this valuable: the fact that Groq’s chips could help Nvidia make their supply chain more resilient and less reliant on TSMC.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

Supply chain resilience, and reducing reliance on TSMC.

This is my personal pet theory, so I saved it for last. Of all of the different AI chip startups, Groq is unique in that its product just consists of a straightforward logic die manufactured in a Samsung process node. SambaNova relies on TSMC’s Chip-on-Wafer-on-Substrate (CoWoS) packaging to leverage high-bandwidth memory (HBM), just like Nvidia’s chips. Furiosa, Etched, and MatX all use, or plan to use, HBM as well, so they also need a 2.5D packaging solution to package the HBM next to the logic die. And each 2.5D packaging solution is tied to a specific foundry and packaging provider — probably TSMC.

Other architectures without HBM also have foundry-specific features. d-Matrix uses TSMC-specific, custom SRAM bit-cells for their processing-in-memory architecture, as well as an organic interposer for their chiplet-based packaging solution. And Cerebras’s wafer scale engine requires a ton of specialized processing that was developed in collaboration with TSMC.

Nvidia’s Blackwell B200 and upcoming Rubin architectures are already incredibly reliant on TSMC’s CoWoS packaging, which has limited production volume and is incredibly expensive. If Nvidia wants to make their supply chain more resilient and make themselves less dependent on TSMC, there are very few acquisitions they could make that would help them out. SambaNova, Furiosa, Etched, MatX, Cerebras, and d-Matrix all rely on some foundry-specific feature that would make their design extremely hard to port from one process node to another. That means that if TSMC suddenly runs out of production capacity, raises CoWoS prices, or even worse, gets invaded by mainland China, Nvidia would still be out of luck.

Groq’s chips, on the other hand, are already manufactured in a Samsung process, and could easily be ported to an Intel process. By relying on a simple logic chip that can be manufactured by multiple foundries, the supply chain for Groq’s chips is extremely resilient, even in the case of surging demand for AI chips, HBM, and advanced packaging processes. This idea isn’t new; in the past, Apple has manufactured the same chip on different process nodes to ensure they had sufficient capacity for the immense demand of new iPhone launches.

By acquiring Groq’s assets, Nvidia can now sell more AI chips than TSMC has CoWoS production capacity to make for them. They can sell more AI chips even if they’ve gone through their entire HBM allocation. Groq’s chips are worse than Nvidia’s own Blackwell B200s, but Nvidia is already going to be able to sell every B200 they can possibly produce. The LPUs mean that Nvidia can keep selling chips even after the limited B200 stock runs dry.

By acquiring Groq’s assets, Nvidia can now sell more AI chips than TSMC has CoWoS production capacity to make for them.

The 2025 Zach's Tech Blog Holiday Roundup

zach — Wed, 24 Dec 2025 20:46:58 GMT

Merry Christmas, Happy New Year, Happy Hanukkah, and enjoy all of your other holidays as well! In between my quest to find the best hot buttered rum in New York City and traveling back to my hometown of Los Angeles, I’ve decided to revive the Zach’s Tech Blog holiday tradition of doing a little roundup of all of my favorite things I’ve written this year!

I started 2025 with about 700 subscribers. This year, we’ve more than doubled that, to over 1,600! We’ve had some hits as well, getting over 8,000 views on my most popular article this year. I’ve gotten coffee and drinks with multiple people I’ve met through this blog, I’ve been invited to speak at conferences, and overall, I’ve had an amazing year writing all these articles. And I wouldn’t be able to do it if there weren’t people reading!

So… thank you so much for reading for another year!

I really appreciate everybody who reads these articles, likes them, comments, or reaches out to tell me they enjoyed something I wrote. I read it all (really!) and seeing smart, cool, interesting people appreciate the work I do warms my heart. If you do feel inspired to message me on Twitter on LinkedIn, feel free to say hi!

Now on to my favorite articles of the year!

My Hottest Take

Most people think that large language models (LLMs) are pretty bad at writing Verilog, the special language we use to design digital chips. Unlike high-level programming languages like Python or JavaScript, the performance of a piece of Verilog code is highly dependent on the quality of the Verilog code itself, and LLMs generally write mediocre Verilog at best. There’s a reason why most startups using LLMs for chip design are building products for verifying chips, rather than writing Verilog.

I definitely agree that LLMs are a very, very long way away from writing the kind of high-quality Verilog that you’d find in the core mathematics engine of something like an Nvidia GPU. But most chips have a lot of subsystems that are acquired from third-party vendors, called “IP cores”. A large, modern chip will have a substantial part of its development budget allocated to IP cores. And a lot of these IP cores can fulfill their functions even if their implementation is mediocre. So I made a prediction:

We haven’t actually seen an LLM-powered IP company be founded yet, but Sphere Semi is basically doing this exact thing for analog chips and analog IP. They’re developing reinforcement-learning powered AI tools for analog chip design, but instead of selling those tools, they’re selling chips and IP cores that they develop in-house using those tools.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

My Most Popular Article

Every chip designer hates the oligopoly that Cadence, Synopsys, and Siemens have on the specialized software we use to design chips. What’s surprising, though, is that nobody’s successfully disrupted this oligopoly; after all, Silicon Valley startups are really good at making software capable of disrupting legacy incumbents. Those in the know, however, are aware of the inter-company politics that keep startups from successfully selling chip design software.

I think that a lot of non-chip-designers probably appreciated getting a peek into the weird world the politics behind the chip design software industry. I also tried to end on a hopeful note: there are some startups that might be able to disrupt the industry, and I’m rooting for them!

This article also got me a bit of flak because I suggested a political marriage of an EDA startup founder to Morris Chang’s daughter as a solution to the Cadence / Synopsys / Siemens oligopoly. Though when everybody is talking about how Taiwan is just like Arrakis from Dune, I don’t think a political marriage should be totally out of the question.

a16z American Dynamism is holding auditions for Paul Atreides

My Favorite Article

Just like last year, my favorite article is about side-channel attacks on AI chips. Secure AI hardware is a really interesting mix of the two biggest focuses I’ve had in my career and in my writing: AI accelerators, and secure hardware design. Unfortunately, a lot of people see secure and tamper-proof AI hardware as unnecessary. Those people are wrong:

The BarraCUDA attack is an extremely important proof-of-concept for side-channel attacks on real-world AI systems. If you want to deploy an AI model on an edge device like a security camera, a drone, or a self-driving car, you have to assume that anybody with physical access to the device can steal your model weights. And if they have your model weights, they can find adversarial examples that make your model go haywire. For example, they could install special graffiti on stop signs that makes your self-driving model think it’s actually a different kind of sign.

Eykholt et al, 2018

On that slightly terrifying note, that’s all for this year! Happy holidays, thanks for reading, and I’ll catch you in the new year with some new articles!

- Zach from Zach’s Tech Blog

Why did Irreducible fail?

zach — Mon, 01 Dec 2025 15:54:19 GMT

Last month, Irredudible, the company founded to use field programmable gate array chips (FPGAs) to accelerate zero-knowledge proofs (ZKPs) shut down. This came only a couple months after they announced a pivot away from FPGAs and into high-performance software implementations of ZKPs.

This surprised a lot of people in the industry. The conventional wisdom was that zero-knowledge proofs deserve custom hardware, just like AI workloads do. Both AI and ZKPs require computers to perform a lot of complex heavy-duty computations, but the result of all those computations should be valuable enough to make custom hardware worthwhile. In the case of AI, that result is a computer you can talk to. In the case of zero-knowledge proofs, it’s a mathematical way to guarantee you did a computation, without actually revealing the results.

This has a ton of valuable use cases. You can, for example, use a ZKP to prove that an AI model was trained with a specific approved dataset, without any adversarial examples secretly baked in. But by far, the largest use-case for ZKPs is in cryptocurrency. Zcash uses ZKPs to create a fully anonymous blockchain. Zero-knowledge rollups can improve the efficiency of blockchains by doing some computation off-chain, and only storing the proofs on-chain.

However, in the past year or so, we’ve been seeing the companies working on custom ZKP hardware struggle. Ignoyama started focusing purely on software, rather than FPGA hardware, and their core libraries started to hedge into high-performance implementations of post-quantum cryptography algorithms, rather than focusing purely on ZKPs. Fabric Cryptography’s ZKP chip still hasn’t hit the market after multiple years of promises. Irreducible shut down.

So clearly, AI workloads and ZKPs aren’t the same. AI chip companies are succeeding, while ZKP hardware companies are struggling. What are the key differences?

ZKPs are much more diverse

Most AI workloads are fairly similar. They consist of a lot of floating point matrix multiplications, and then a smaller amount of nonlinear activation functions. Some workloads might need 32-bit floating point numbers, while some might only need 8-bit floating point numbers, and some workloads may require specific nonlinearities that are difficult to implement efficiently — but for the most part, most AI workloads don’t look wildly different from one another.

Different ZKP algorithms, on the other hand, can be extremely different from one another. Some ZKPs, like Binius, operate on binary tower fields using bitwise operations. Plonky2 requires number-theoretic transforms, which require significant memory bandwidth. Groth16 requires less memory bandwidth, but operates on 381-bit integers. And other ZKP schemes like Halo2 and Bulletproofs have their own computational bottlenecks.

If you’re building hardware, this is a huge problem. For example, Fabric Cryptography’s VPU features 384-bit ALUs, which will be totally underutilized when running the bit-sliced operations required by Binius. And importantly, ZKPs aren’t standardized the way that conventional cryptography algorithms are. This put Irreducible in a tough spot. They could spend their time developing the best hardware in the world for one specific ZKP, but if a new ZKP suddenly becomes popular, they have to go back to the drawing board.

This is part of why Irreducible tried to advocate for Binius as a proving scheme. If they encouraged everybody to stop trying out new ZKPs and pick one algorithm that was well-suited to Irreducible’s hardware, that could solve the problem of needing to constantly support new algorithms. But there are already so many different L2 rollups and privacy-focused blockchains that already have non-Binius ZKPs baked into their protocols. By the time Irreducible shut down, there were no publicly deployed, production L2 rollups or privacy chains that used Binius.

Notably, conventional cryptographic hardware accelerators don’t have this problem, because conventional cryptography is standardized. Most cryptographic accelerators only accelerate a single algorithm; Rambus sells a separate SHA-2, SHA3, and AES engine! In the world of conventional cryptography, flexibility and cryptographic agility is a value-add; only the best solutions on the market, like BTQ’s QCIM, offer support for multiple algorithms, and inferior inflexible solutions still maintain substantial market share. In the world of ZKPs, though, support for multiple wildly different algorithms is a baseline requirement that companies like Irreducible and Fabric failed to meet.

But that’s still only half the answer. It’s extremely difficult, but it’s definitely possible to build flexible cryptographic hardware. It’s possible to partner with specific L2s to get them to use your specific ZKP algorithm. And yet Irreducible still failed. Why?

Thanks for reading zach's tech blog! This post is public so feel free to share it.

The ZKP market is much smaller than the AI market

It’s obvious that the generative AI market is huge, even if a significant portion of that market is cheating on high school essays and generating slop. The same simply isn’t true of zero-knowledge proofs

Zcash may have a $7B market cap, but the vast majority of Zcash transactions don’t even use the private zero-knowledge proof transaction features. Scroll, one of the largest ZK-rollups, has a total value locked that fluctuates between $100M and $1B. That’s the same size as a large seed round for an AI startup.

Ultimately, most cryptocurrency users don’t care much about the privacy features ZKPs offer. As the cryptocurrency industry has evolved, it’s become less focused cryptocurrency as a means of exchange, and more on cryptocurrency as an investment. And as large institutions invest in cryptocurrency, this only becomes more true.

Now, this isn’t necessarily a bad thing! More institutional investment activity could reduce volatility and help make cryptocurrency into a reliable part of a complete investment portfolio. But zero-knowledge proofs aren’t a key technology to enable that. The key features that ZKPs enable are private transactions and more efficient rollups with significantly lower transaction fees. These are valuable in the use-cases where cryptocurrency is transacted as a means of exchange, not used as an investment.

Again, this isn’t something that’s true of conventional cryptography hardware. There are over 50 billion secure elements in the world, and that doesn’t include the cryptographic hardware accelerators inside of nearly every laptop, desktop, and smartphone SoC on the planet. Cryptography is necessary to guarantee basic privacy and security, so secure cryptography hardware is a requirement to comply with nearly every privacy-related regulation from any government in the world, creating massive demand.

Ironically, it’s the decentralized nature of cryptocurrency that reduces the demand for ZKP hardware. Investors care much more about cryptocurrency as an investment, rather than its ability to enable efficient or private transactions. And because there’s no central authority to enforce privacy regulations, ZKPs fall by the wayside. It’s unfortunate, because Irreducible’s technology was great… the crypto community just didn’t really want it.

Why did Cadence acquire ChipStack?

zach — Tue, 11 Nov 2025 17:00:23 GMT

This week, Cadence Design Systems, one of the big 3 electronic design automation (EDA) companies that dominate the market for the software that engineers use to design microchips, acquired a small startup called ChipStack. ChipStack uses AI to speed up silicon verification, which is the process of testing chip designs in a simulation before sending them off to be manufactured. Cadence didn’t announce how much they spent buying ChipStack, which usually means it was a somewhat small acquisition — in the range of tens to hundreds of millions of dollars. But the implications for other startups developing AI-powered EDA software is massive.

For those of us who know the EDA industry well, this isn’t a surprise. As a matter of fact, this is the standard modus operandi of the EDA industry. Startups innovate, get acquired for $50-500M by the big players, and then get folded into their product portfolio. It’s much easier to sell EDA software when it’s bundled with existing Cadence products rather than packaged as a standalone product. That just leaves us to wonder: why did Cadence acquire ChipStack specifically? And what does this mean for all the other startups in the space?

Why ChipStack?

From the outside, ChipStack doesn’t look all that different from their competitors, like ChipAgents and Bronco. However, if you dig in a bit deeper, the differences get clearer. ChipAgents just raised a $21M Series A, while ChipStack has only raised about $7M to date — that means that ChipStack’s valuation is lower, so they could be acquired for a much lower price. Bronco, on the other hand, is run by college dropouts, and probably not a great acquisition target for a large, corporate company like Cadence.

A lot of the other AI-powered verification startups have similar stories; Cognichip raised $33M, so they probably would have been a lot more expensive to acquire than ChipStack. Normal Computing1 offers an AI-powered verification product, but also develops thermodynamic computing chips, which are totally outside of Cadence’s wheelhouse. Ultimately, ChipStack was probably the most reasonably priced acquisition target that would fit well within Cadence as a parent company.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

What about other AI chip design companies?

For AI-powered chip verification companies building direct competitors to ChipStack, like ChipAgents and Bronco, this acquisition is both a blessing and a curse. With Cadence clearly planting its flag in AI-powered verification software, the pressure is on for Cadence’s two biggest competitors, Synopsys and Siemens EDA, to make a move. This could mean that ChipStack’s competitors acquire their own AI-powered verification startups over the next few months.

However, there are a bunch of different AI-powered verification startups out there, and the EDA industry is an oligopoly. After Cadence, Synopsys, and Siemens all acquire AI-powered verification capabilities, there’s no other large player who could acquire any remaining competitors to ChipStack. That means that all those other startups either need to make it to IPO or get acqui-hired for far less than they’re worth — or they’ll go out of business.

Not all of the AI-powered chip design startups are competitors to ChipStack, though. Silimate, for example, is building an AI-powered tool to predict the performance of chip designs earlier on in the design process. Silogy is building an AI agent to debug test failures, rather than just writing tests using LLMs like ChipStack. These companies are in a much better position following this acquisition, because it signals that potential exit opportunities are opening up, without closing the door to a potential acquisition by Cadence.

Ultimately, the ChipStack acquisition is likely the first in a series of M&A announcements we’ll hear over the next couple years, as the EDA oligopoly gobbles up the best deals on AI-powered EDA startups. And unfortunately, the companies that don’t get picked by the end of it may end up dying.

Disclosure: I used to work at Normal Computing as their silicon engineering lead. I am still an advisor there.

So, I have to talk about Extropic.

zach — Tue, 04 Nov 2025 20:39:39 GMT

Disclosure before I jump into the article: I used to lead the silicon team at Normal Computing, a thermodynamic computing company that is considered a competitor to Extropic. I don’t work there anymore, but I’m still an advisor there.

After years of doubt and claims that Extropic was a total grift, Guillaume Verdon and his crew finally announced their first product, their XTR-0 devkit. A lot of people are claiming that this vindicates all of Extropic’s claims and “proves the haters wrong”. But that’s not entirely true.

Yes, the folks who claimed that Extropic was a total scam that would die without ever shipping a piece of hardware were wrong. Extropic did design a chip that is capable of running certain small ML workloads with impressive efficiency. That’s cool, but also most PhDs in chip design do at least one or two small ML chips as part of their graduate research. Building an incredibly efficient chip that only works on MNIST is a bit of a punchline at this point.

Also, everything Extropic is doing is building off a large existing body of work on p-bits and probabilistic computing. I’ve been working in the unconventional computing space most of my career, and I’ve been plugged into what’s actually been going on in the world of p-bits, thermodynamic computers, and Ising machines. So today, we’re going to be talking about what Extropic is doing, the current state of the p-bit research landscape, and their novel contributions.

The Current p-bit State of the Art

Extropic makes one claim early on in their paper that I think is extremely unfair:

Additionally, existing devices have relied on exotic components such as magnetic tunnel junctions as sources of intense thermal noise for random-number generation (RNG). These exotic components have not yet been tightly integrated with transistors in commercial CMOS processes and do not currently constitute a scalable solution.

Put simply, this is false. Most state-of-the-art work on p-bits is implemented using digital logic gates, which is highly scalable on FPGAs and commercial CMOS processes. These architectures use digital pseudorandom number generators (PRNGs), which offer an efficient way to generate numbers with sufficient randomness without relying on exotic materials, with the added bonus of reproducibility due to their pseudorandom nature.

There are also a large number of analog architectures out there that are scalable, manufacturable, and don’t use exotic technologies. Not only have researchers demonstrated analog p-bits multiple times over, but there’s an equivalence between noise-injected Ising machines and p-bit samplers. That means that all of the analog Ising machine designs, from ring oscillators to cross-coupled latches, can also act as p-bits -- and they’re all manufactured in commercial CMOS processes.

Hell, if we’re less strict and include non-p-bit sampling architectures, analog and digital CMOS implementations of Boltzmann machines have been around since the 1990s.

Extropic also proposes other probabilistic bit structures that mirror existing work in literature. Extropic’s p-dits and p-modes are similar to current state-of-the-art work on p-dits and programmable Gaussian distributions. To their credit, though, existing work on p-ints and p-dits are digital; if Exropic’s design is all-analog, that would represent a novel design for that specific probabilistic primitive.

Overall, there are a large number of competing designs for p-bits, all with different tradeoffs. The digital designs are often slightly larger and more power hungry, but are much easier to design, more scalable, and don’t suffer from issues due to process variability that plague analog designs. Analog designs may be smaller and more efficient, but often struggle to implement high-precision energy functions and may make computational errors due to process variations and parasitic effects introducing spurious biases and noise terms.

Process variability can affect p-bit sampling

It’s unclear how much of Extropic’s architecture is analog and how much is digital. It’s also hard to tell how, if at all, they’re addressing issues of process variations in their design. Ultimately, these are the practical issues that plague most analog implementations of Boltzmann machines, and represent the reason why digital implementations are currently dominant in the literature. We’ll see if Extropic has a novel solution to this problem or not.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

Comparing to GPUs

Extropic also makes the bold claim that their chips are 10,000x more efficient than a GPU. Unfortunately, these headline numbers are fairly rough estimates for both their hardware, and for GPUs. Because Extropic’s target task is so simple, there isn’t an optimized GPU state-of-the-art model to compare it to. Instead, their GPU estimates measure the power consumption of very small models on large, powerful hardware; this likely results in extremely poor utilization of the GPU and a lot of wasted energy.

At the same time, their analysis of their hardware is fairly optimistic. They only consider the energy consumed by their RNG, resistor network, clock tree, and local wire capacitance. The Extropic team does admit that this is a very rough estimate, but also raises one big question: if this chip actually implements a full Boltzmann machine, why not just measure its power consumption and report it? It could be the case that Extropic’s test chip has p-bit test structures, but lacks the resistor network between them to implement an entire Boltzmann machine on-chip.

One other major concern of mine is that Extropic’s hardware implements bipartite graphs. Not only does this limit your hardware to natively accelerating sparse models, but it also makes your hardware a lot less competitive against GPUs.

The biggest reason why Gibbs sampling is really slow on GPUs is its serial nature. If you try to update multiple states at the same time, you run into weird issues where “frustrated” connections start to cause oscillations in the system state and errors in the result. This is bad for GPUs, which are natively parallel processing units. But when you start working with bipartite graphs or other sparse graphs, this problem gets a lot easier, and you can update all the nodes in each half of the graph in parallel. Restricted Boltzmann Machines, a kind of EBM that relies on sparse graphs, were partially popularized because they can run so much more quickly on GPUs! If you’re only focusing on bipartite graphs, GPUs are actually not all that bad at sampling.

So, what’s new here?

Even though Extropic’s paper makes some questionable comments about the state-of-the-art, they do make two novel contributions to p-bit research. Specifically, they propose a new kind of energy-based model, the Denoising Thermodynamic Model, or DTM, with potentially greater stability and efficiency than conventional EBMs. They also propose a more efficient RNG for their p-bits.

Conventional monolithic EBMs draw samples from a single distribution, often parameterized by a neural network. Extropic correctly points out that this often results in distributions that are extremely hard to sample from. Their DTMs instead perform multiple de-noising steps, similar to a diffusion model, where each step is relatively easier to sample from. This idea isn’t totally novel; it’s similar to other techniques to improve sampling, like conventional diffusion models and counterdiabatic driving, but it’s optimized for the specific properties of Extropic’s p-bit computer.

Their p-bits also use a novel RNG architecture. Current digital implementations of p-bits currently use digital PRNGs combined with LUTs to create a stochastic sigmoid activation function. The most recent, non-state-of-the-art result I found uses 42 look-up tables and 33 registers in an FPGA to represent the p-bit PRNG and activation function. Depending on the process node, an ASIC implementation of the same logic could take up hundreds of square microns of silicon area. Extropic’s RNG is only nine square microns, which is a meaningful improvement!

However, it’s unclear how robust the sigmoid transfer function of their RNG is to process variations, as the only information on process variation provided relates to the energy and lag time. One of the major advantages of the digital PRNG designs is that they get to ignore mismatch entirely. Also, digital PRNG designs are much, much faster than Extropic’s analog RNG. Extropic’s RNG produces random bits at a rate of 10MHz, while digital PRNGs can often hit frequencies approaching 1GHz. Hopefully Extropic’s upcoming physical review letters paper sheds some additional light here.

My take on Extropic

Extropic is doing some genuinely interesting research on p-bits and energy-based models. If their RNGs are actually more efficient than the PRNGs used by existing p-bit architectures, that could offer a meaningful improvement in the number of p-bits you can fit on a single chip, as well as their power consumption.

Also, their results seem to genuinely show that their new Denoising Thermodynamic Models genuinely out-perform more naive implementations of EBMs. That’s genuine research progress! And if their new models are also more performant than conventional diffusion models in a more fair head-to-head comparison with a GPU, that could be an interesting first step towards scaling up EBMs on probabilistic computers.

Ultimately, though, the idea of p-bits isn’t novel. The idea of analog p-bits isn’t novel. Running EBMs using p-bits isn’t novel. Extropic is specifically proposing a new p-bit RNG and a new kind of EBM that could improve the performance and scalability of probabilistic computers. It’s interesting research progress, and would make for a very impressive PhD thesis or two, but it’s not “climbing the Kardashev scale through worship of the Thermodynamic God”, or whatever Gill has been tweeting about these days.

I will give Extropic one thing, though. The limiting factor for p-bit adoption is the lack of interested users and worthwhile algorithms. The sheer amount of hype and interest they’ve generated may bring p-bits from an obscure academic interest for unconventional computing nerds, to something that a much larger crowd of researchers are interested in tinkering with. There’s a chance that turns into a real algorithmic breakthrough that could enable their hardware to run more meaningful, large-scale workloads than just Fasion-MNIST.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

Making Unconventional Computing Practical

zach — Wed, 22 Oct 2025 15:22:30 GMT

This blog post is based off a talk that I’ve given a couple times: first at Homebrew, then at Betaworks, and finally at Foresite Labs’ Reinventing the Semiconductor Future Symposium this past week. It’s an interesting topic -- if unconventional computing is really so unconventional, how does it have any chance of becoming practical or mainstream?

Well, unconventional computing is an umbrella term. It can range from the extremely weird, like slime mold computing, to the relatively mainstream, like processing-in-memory. Broadly, unconventional computing refers to all methods of doing computation that differ from the standard Von Neumann architecture, which forms the basis of modern CPUs and, to some degree, GPUs. For decades, the academic community has done research on novel methods of computing, but recently, driven by the end of Moore’s Law transistor scaling and the rising computational demands of AI, some startups and large corporations have been trying to commercialize these unconventional computing technologies.

I’ve worked at some of those startups, and I know the stories of many more. It turns out that taking research ideas, which are designed to maximize novelty and land in impressive journals, and trying to commercialize them is difficult. This blog post, like my talk, is about the lessons learned from trying to move three different unconventional computing ideas out of the lab and into industry.

Processing-in-Memory

When I first encountered processing-in-memory, the dream was simple, but exciting. If we could store data as analog values in a resistive crossbar, we could perform entire vector-matrix multiplications at once. It was incredibly alluring. This sort of architecture would offer incredibly dense data storage, significantly reduce data movement, and take an iterative algorithm and reduce it to a single atomic operation.

There were a couple companies trying to commercialize this sort of architecture. The most well known is Mythic AI, which used flash memory cells as the resistive element in their crossbar. Another startup, Syntiant, proposed doing a similar thing, but was targeting much lower-power applications. Then a large number of research labs were working on using novel resistive memory devices, like resistive RAM (RRAM) and magnetoresistive RAM (MRAM), to implement these crossbars. RRAM and MRAM are a lot less mature than flash memory, but offer increased linearity and density.

(source: WikiChip)

Unfortunately, that original alluring PIM dream is dead. Mythic famously nearly ran out of capital before raising a small down round and replacing their CEO. Syntiant quietly dropped their analog architecture in favor of a conventional digital chip, which has been fairly successful. Many of the research labs working on RRAM and MRAM have struggled to get their chips to scale past simple image classification workloads, owing to device-level mismatch and variability over process, voltage and temperature variations.

“But Zach”, you might object, “haven’t there been successful processing-in-memory companies?” And the answer is actually, yes! My previous startup, Radical Semiconductor, built best-in-class cryptography accelerators using PIM technology. We were acquired by BTQ, where our architecture forms the basis of BTQ’s QCIM offering. d-Matrix is selling fast and efficient LLM accelerators using PIM technology as well. But neither BTQ nor d-Matrix are working on that original analog PIM dream.

Analog PIM technology offers a number of advantages: reduced data movement, dense storage, and one-shot vector-matrix multiplication. But those advantages are not created equal: just reducing data movement offers a massive performance and efficiency boost on its own. Doing an 8-bit addition operation consumes about 0.3pJ of energy, but fetching data from memory costs upwards of 10pJ. If we develop PIM technology that gives up on multi-level memories and analog one-shot operation, we can still get a 10-100x performance improvement just by reducing data movement.

(from Dr. Mark Horowitz, Stanford)

d-Matrix is developing digital processing-in-memory for matrix-vector operations; by integrating custom digital adder trees directly into SRAM memory arrays, they can still get huge performance advantages. The memory isn’t nearly as dense as the analog arrays proposed by Mythic or the RRAM labs, but they still reduce data movement massively. BTQ’s QCIM focuses on the vector-vector operations at the heart of the number theoretic transform, which powers NIST standard cryptographic algorithms like ML-KEM and ML-DSA.

By narrowly focusing on the biggest advantage PIM offers -- reducing data movement -- and ignoring the secondary benefits of all-analog architectures, startups have built legitimately successful products. And the story is pretty similar for another field that’s traditionally used analog architecture: neuromorphic computing.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

Neuromorphic Computing

The original dream of neuromorphic computing, cooked up by Carver Mead in the 1980s, was to build ultra-efficient analog circuits that operated similarly to neurons in the brain. Neurons communicate with electrical pulses, called spikes, that are integrated as charge across the cell membrane of the neuron. The most traditional neuromorphic systems operate the same way, with spikes of voltages integrated onto capacitors inside of silicon neurons. Some neuromorphic systems, like Neurogrid, take this to an extreme, attempting to model all of the complex ion channels in human neurons as accurately as possible using silicon.

But when researchers tried to use these neuromorphic computers to accelerate meaningful machine learning models, they hit a ton of roadblocks. Because neuromorphic architectures use analog circuits, they struggle with temperature changes and manufacturing variability. But also, the spiking neural networks, or SNNs, native to neuromorphic computers aren’t compatible with conventional machine learning techniques. Gradient-based training methods, like backpropagation, don’t support the non-differentiable spiking activations inside of SNNs. This makes commercialization of neuromorphic computing technologies difficult. Customers want hardware that can run the models they already know and love.

There are some successful efforts in the realm of neuromorphic computing, though! But just like d-Matrix and BTQ did for processing-in-memory, they focus on achieving the key advantages of neuromorphic computing in the most practical way. In the case of neuromorphic computing, that advantage is sparsity. Neuromorphic computing systems are so efficient because they operate using sparse trains of spikes, rather than dense vectors of activations. By focusing on sparsity specifically, some companies have been able to build commercially viable neuromorphic-inspired chips.

The team at femtoAI (formerly Femtosense) are building neuromorphic-inspired ultra-power-efficient edge AI processors tailored to sparse networks. However, their chips use digital rather than analog circuits, and operate on sparse conventional neural networks, rather than the esoteric and unwieldy SNNs used by more traditional neuromorphic systems. Their focus on the most impactful benefit of neuromorphic computing ultimately allowed them to get meaningful commercial traction in the smart home and hearing aid markets.

Thermodynamic / Probabilistic Computing

Last but not least is thermodynamic and probabilistic computing, two related fields both focusing on leveraging specialized architectures to accelerate stochastic differential equations (SDEs) and stochastic random sampling workloads. I ran the silicon team at Normal Computing for a year, and before that worked on DIMPLE, the largest open-source fully-connected Ising machine ever demonstrated -- so I know the space pretty intimately.

The story of thermodynamic and probabilistic computing starts off similarly to that of processing-in-memory and neuromorphic computing: a big dream and a lot of cool ideas for new circuits and new materials. Supriyo Datta’s group at Purdue proposed building probabilistic bits (p-bits) using stochastic magnetic tunnel junctions, which required new materials and fab processes. Normal Computing proposed analog circuits with thermal noise to act as unit cells that could be coupled together. And Extropic proposed superconducting circuits to implement their p-bits. All of these were exciting ideas. None were particularly practical.

Currently, the largest published result demonstrating a p-bit system comes from Karem Camsari’s lab. They don’t use magnetic tunnel junctions or analog circuits. Instead, they use a digital random number generator to generate noise, and digital logic to implement couplings between p-bits. Normal Computing is also leveraging a digital architecture to accelerate the SDEs at the heart of diffusion models. And Extropic has ditched their esoteric superconductors in favor of mixed-signal CMOS circuits.

It took a little while, but so far, it’s seemed like the innovators in the world of thermodynamic computing have learned the lessons from other kinds of unconventional computing. Often, analog circuits are expensive and difficult to implement, even though they may be elegant and exciting. Digital architectures are reliable, efficient, and easy to manufacture and sell at scale. And so when unconventional computing technologies move from academic labs into industry, which cares less about publishing novel papers and more about selling chips, scalable digital architectures designed to maximize the key benefits of a new technology, while avoiding its biggest downsides, often win out.

RL isn’t the silver bullet for AI-powered chip design

zach — Thu, 25 Sep 2025 15:56:30 GMT

My most popular post of all time was about why Y Combinator is wrong about LLMs for chip design. Put simply, I don’t think LLMs are able to generate high-performance or high-efficiency chip designs, because the process of designing high-performance chips is incredibly unforgiving. The entire “vibe coding” trend is about generating a large quantity of mediocre code, while the challenge of chip design is writing a relatively small amount of extremely high-quality, extremely performance-sensitive code in a specialized language called Verilog. And so far, I’ve been proven right. By and large, LLMs kinda suck at writing Verilog.

Never fear, though! Reinforcement learning (aka RL) will come to the rescue, right? RL has proven invaluable when it comes to making normally error-prone AI agents solve unforgiving tasks, from playing board games to constructing complex mathematical proofs. Why can’t we leverage RL systems to train AI models to write the sort of high-quality, performance-sensitive Verilog that is required to deliver high-performance chips?

Well, it turns out that it’s really, really hard, but not for the reasons you might think. Put simply, existing chip design tools are just too slow to allow any RL system to learn anything useful in a reasonable amount of time.

How would RL help?

Reinforcement learning is a training methodology for AI models that lets them improve at a specific task through trial-and-error. This is particularly useful for tasks that are not well-represented in the training set, but can be quick for a model to try and fail at. A great example is mathematical proofs. Google’s AlphaProof was able to learn to solve IMO math problems by repeatedly trying solutions and seeing if they were logically consistent.

One of the key pieces of the RL training process is the “RL environment”. These environments are sandboxes that allow AI models to perform tasks and see how well they do. In the case of AlphaProof, the environment was an automated theorem prover that could check if the model’s generated proof was correct. For coding models, the environment could measure how many unit tests a model’s code passes.

In the world of chip design, the environment seems pretty obvious. Most industry standard EDA tools can measure a design’s correctness, speed, size, and power consumption. So we should be able to train a model to write Verilog code, floorplan a chip, design SPICE netlists, or lay out analog circuits, and then measure the performance of those circuits using these industry standard chip design and analysis tools. But in practice, this is extremely difficult to actually pull off.

Why is it so hard to use RL for chip design?

If RL allows AI models to effectively learn from a reward signal, and chip design tools have such clear reward signals, why hasn’t RL resulted in an AI chip design breakthrough? Well, the problem is that chip design tools run extremely slowly. The process of going from Verilog to a set of logic gates, called synthesis, can take hours for a large design. Actually placing those logic gates on a chip layout and wiring them all up, called place and route, can take even longer. And then analyzing the timing and power information from that final design takes a while, too.

The same problem exists in the analog design world. Analog designs feature fewer transistors, but they need to be laid out and simulated with far more precision than digital logic gates. Circuit simulators need to solve complicated differential equations and take manufacturing variability into account, or a circuit that works in simulation may not work when it’s actually baked into a physical chip. Often, signing off on an analog circuit design requires hours or even days worth of simulation.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

How do we go faster?

If the biggest challenge stopping RL from being useful for chip design is how slow the tools are, why not just make the tools faster? Well, a handful of new tools may make that possible.

Silimate offers an AI-powered PPA prediction tool that can estimate a design’s power consumption and timing characteristics, without having to go through the conventional synthesis and place-and-route process. Their methodology is far faster than conventional methods, and could form the basis of an RL environment that can learn at a fast enough pace to actually get good at writing Verilog. However, having an AI-powered reward function is also risky; the AI model being trained could end up reward hacking by writing code that the PPA prediction tool thinks will perform well, but would actually perform poorly when passed through the conventional, slower synthesis and place-and-route tools.

There are also companies like Partcl who are tackling the problem of accelerating silicon tooling more directly. By accelerating place-and-route and static timing analysis using GPUs, they can speed up key aspects of a chip design RL environment by orders of magnitude. However, building a complete set of chip design tools with full compatibility with modern PDKs is really, really hard -- and making it fast enough to enable RL is even harder. On the analog design side, there’s Dash Crystal, who are also building GPU-accelerated simulation tools.

Ultimately, I have faith that eventually, we’ll get fast enough chip design tools to enable RL. Whether or not the engineers at these startups are even thinking about RL or not, faster simulation and analysis tools for chip design are valuable in and of themselves. So while it may take a lot of time and effort to develop the full suite of tools necessary for a high-speed, high-performance RL environment for chip design, all of the intermediate steps are still valuable to us human chip designers by making our current workflows that much faster.

How did AMD’s secure enclave get hacked?

zach — Mon, 18 Aug 2025 14:21:55 GMT

In the past couple of years, large swaths of the security community have been getting excited about trusted execution environments (TEEs) and secure enclaves (SEVs). Compared to other privacy-enhancing technologies like homomorphic encryption and zero-knowledge proofs, TEEs represent a much faster, cheaper, and more efficient way to secure data. This makes TEEs a viable solution for securing large, complex workloads. People have proposed using TEEs to secure AI workloads, to accelerate blockchain rollups, and enable data cleanrooms. But TEEs aren’t as secure as we might want to believe.

Unlike cryptographic privacy-enhancing technologies, TEEs and SEVs aren’t mathematically guaranteed to be secure. They’re only secure because a bunch of hardware security engineers worked hard to try and figure out every vulnerability and protect against it… but sometimes they miss things. Earlier this year, we talked about BadRAM, which modifies DRAM memory modules to enable unauthorized access to secure memory. There are a large number of speculative execution vulnerabilities, which leverage performance-enhancing processor features to leak data access patterns. And that’s not even mentioning side-channel attacks, which require physical access to a chip but can easily leak cryptographic keys.

Well, there’s a new TEE vulnerability in town. The Heracles attack on AMD’s SEV technology enables a malicious hypervisor to read secure memory from a user process. Essentially, this means that a malicious cloud provider could access secret customer data running in confidential TEEs on their hardware. Today, we’re going to walk through the specifics of the attack, how to stop it, and what it means for TEE design in the future.

How do TEEs encrypt memory?

Trusted execution environments let user processes define different memory regions as enclaves, which can only be accessed by that specific process. This prevents other processes on the system from accessing or tampering with a user’s secret data. Most TEEs also feature both hardware and software isolation features to prevent processes from accessing each others’ memory regions. But encrypting large amounts of active memory is a difficult problem.

To make a TEE work efficiently, large blocks of data need to be securely encrypted without significantly impacting the latency of accessing and using that data. At the same time, a user needs to be able to access and decrypt any chunk of data from anywhere on the disk without too much overhead. Many common block cipher modes, like cipher block chaining, encrypt data in blocks such that each block depends on the previous block. This means that a user would need to read the whole disk to decrypt any block. There are workarounds, but they require adding a kind of metadata called initialization vectors to each sector of the disk.

AMD SEV uses a block cipher mode called Xor-Encrypt-Xor (XEX) for memory encryption. XEX is designed from the ground up to be well-suited to encrypting large chunks of memory. Both XEX encryption and decryption have independent block processing, which means that decryption can be parallelized to significantly improve latency. XEX also avoids initialization vectors by tweaking one of the encryption keys based on the address of the data in memory. That way, a user can read any chunk of memory they want, independently of other blocks.

(from Wikipedia)

This address-based key tweak is important, as it prevents an adversary from copying a user’s data to a different chunk of memory and then decrypting it. Because the copied data is in a different address, it will decrypt with a different key, and turn into gibberish. But the Heracles attack breaks that assumption using some clever memory-swapping tricks.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

How does Heracles work?

Heracles relies on the fact that the hypervisor, which is the piece of software that manages virtualization on a computer, is capable of moving user data around in memory. When the hypervisor does this, the memory controller re-encrypts the memory with a new key, based on the new memory address. But while the encryption keys vary from address to address, they don’t change over time.

To perform the attack, a malicious hypervisor sets up two chunks (pages) of memory: one controlled by the victim, and one controlled by the hypervisor. The attacker observes the encrypted value stored in the user’s memory. Then, he writes data into his memory, and swaps his memory with the user’s. Finally, the attacker observes the encrypted value now stored in his memory, which is located in the same address where the user’s memory just was. If the encrypted value of the attacker’s memory matches the old encrypted value of the user’s memory, they must also have had the same plaintext. If the attacker continues on this way block-by-block, they can reconstruct the entire chunk of secure user memory.

Unfortunately, this is very slow. AES uses 16-byte blocks, so an adversary would need to guess 2¹²⁸ different plaintexts for every block. Luckily, the Heracles team has a clever workaround.

Specifically, the Heracles team tailors their attack to work in situations where the attacker only needs to guess a single byte. This narrows down the search space from 2¹²⁸ down to 2⁸ -- several orders of magnitude faster. And it turns out that there are a lot of situations where such a byte-by-byte attack makes sense. For example, when a user is inputting their password while running the sudo command, that password is written to memory one byte at a time, so Heracles can easily leak it.

Using similar tricks to restrict an attack to single-byte guessing, the Heracles team can hijack a user’s session on a webserver, and they can even leak a user’s cryptographic keys. This makes Heracles a powerful and flexible attack, even with its performance limitations.

So what do we do now?

Luckily, there are mitigations to the Heracles attack. AMD already added support for a policy that disables the ability of the hypervisor to swap guest pages, but this may negatively affect memory utilization and overall system performance. And as far as we know, this group of ETH Zurich researchers are the first to have discovered this attack and disclosed it responsibly, so AMD was able to deploy a fix before anybody was able to leverage Heracles in the wild.

More generally, though, modern TEEs have some fundamental problems. Firstly, they’re proprietary, poorly documented, and often have their security features partially obfuscated by the companies that design them. Because the security community isn’t able to audit these TEEs, there could always be another vulnerability lurking around the corner, waiting to be discovered. That’s part of the reason I’m excited about open-source TEEs and their ability to restore trust into the world of secure hardware.

But also, building secure TEEs faces some fundamental challenges. Modern TEEs like AMD SEV and Intel SGX often act as a set of additional features crudely bolted on to the side of an existing, insecure processor. Building a truly secure processor might require a fundamentally different architecture. For example, many modern CPUs run deep out-of-order pipelines to increase performance, but these often introduce side-channel attacks. A security-oriented high-performance processor might use a large number of parallel compute cores without any pipelining to prevent these sorts of attacks. But that sort of design wouldn’t be possible inside of the current paradigm of adding additional security instructions to conventional processor cores. I hope one day, the demand for secure processing grows to the point where engineers can build hardware that’s actually secure from the ground up.

Why did Tesla Dojo fail?

zach — Mon, 11 Aug 2025 16:54:31 GMT

This week, news broke that Elon Musk is shutting down the Dojo supercomputer it was using to train its full self-driving models. Part of the reason for the shutdown was the departure of a number of key executives, who left to found a startup called DensityAI. However, I think there are clear technical reasons why it didn’t make sense for Tesla to keep developing their own training chips in house.

And Tesla isn’t the first company to have given up on training. Groq, Cerebras, and SambaNova all used to be targeting training as well as inference, but have pivoted, one by one, to a pure inference focus. This isn’t surprising; not only are there unique technical challenges required to build training chips, but also, moving training runs to new hardware is an immense financial risk for any company. Ultimately, I think Tesla is making the right decision to focus its custom silicon efforts on inference and leave training to Nvidia silicon.

The risk of failed training runs

Training AI models is hard. It’s also usually very, very expensive. Training GPT-4 scale large language models can take many months and cost over $100 million dollars. But training isn’t just a matter of spending a lot of money on compute and waiting until the model is ready. During the process of training, engineers monitor the network’s accuracy 1 to ensure that the model is continuing to learn as new data is being fed into the training process. If the network accuracy suddenly starts dropping despite the model getting more data, the model may be failing to converge, which could put the entire training run at risk.

There are a lot of potential causes for convergence failure when training large models, but numerical stability is often a key reason a training run may go wrong. As models are trained at lower and lower precisions, the specific way low-precision numbers are handled in hardware starts to affect how well a training run converges. And unfortunately, this is not something that is constant across different pieces of hardware. Nvidia GPUs have automatic mixed-precision features that will automatically tune mixed-precision models to maximize tensor core utilization, scale the loss to avoid non-convergence, and handle issues with out-of-bounds gradient values.

Mixed precision training with loss scaling (Nvidia Docs)

Existing training pipelines for large models rely on these hardware-specific features. If you want to shift your training process to a completely new piece of hardware that handles mixed-precision arithmetic differently, like the Tesla Dojo, you’ll also need to re-develop your training pipeline. And if you re-develop your training pipeline, there’s a good chance you run into non-convergence issues the first few times you try to actually train models. For large models, this is a hugely expensive risk to run.

Notably, this isn’t an issue for inference. While models do need to be optimized for inference on specific pieces of hardware, that optimization process is much faster and easier. Instead of developing a new hardware-specific training pipeline and having to wait weeks to ensure that it’s numerically stable, an inference implementation can be benchmarked in minutes by simply running forward passes through the network. This is part of the reason why we see much more hardware diversity in inference than in training; put simply, it’s much easier and less risky to port an inference pipeline to a new chip than a training pipeline.

Thus far, Tesla’s training process for their FSD models has leveraged both Tesla Dojo chips and Nvidia chips, which means that Tesla has had to maintain two complex training recipes for large models that each account for the specific quirks of two different pieces of hardware. I believe that simplifying to training purely on Nvidia hardware, while using its custom AI5 and AI6 silicon for inference, makes a lot of sense for the company.

But there’s another piece of this story: all the executives from Dojo who left the company to found a startup called DensityAI. If training on new hardware is so difficult, why are they starting a company to do just that?

Thanks for reading zach's tech blog! This post is public so feel free to share it.

What is DensityAI, anyways?

DensityAI has been pretty secretive thus far. According to Bloomberg, they’re a hardware company focused on industries like automotive and robotics. They’re hiring for roles focused on the datacenter, not on the edge -- so they’re not building inference chips to go in cars or robots. Instead, they’re building infrastructure for training models. But a key clue comes from some of their job postings which specifically mention GPU programming and CUDA programming. At the same time, their only open hardware roles are focused on packaging, PCB design, and thermal management. To me, this indicates that DensityAI isn’t trying to build new chips to replace Nvidia GPUs. Instead, they’re trying to build new datacenters optimized for large-scale training using existing GPUs.

On one hand, this makes some sense. Most automotive manufacturers and robotics companies don’t have the resources or talent to maintain large datacenters for training state-of-the-art models for autonomous systems. DensityAI could make significant inroads with companies that want access to the best AI models for autonomy, but don’t have the capacity to train those models themselves.

However, it’s also unclear to me why autonomous vehicles and robots require specialized datacenters built from the ground up. Many hyperscalers like Amazon, Google, and Microsoft already have compelling cloud datacenter offerings, which I would assume can handle training what is essentially a very large and complex computer vision model. It could be the case that certain automotive reliability and security standards, like ISO 26262 or ISO/SAE 21434 could require specialized, secure training for autonomous driving models that would make the hyperscalar cloud solutions non-viable -- but to answer that question, we’ll have to wait for DensityAI to fully come out of stealth.

Ultimately, it seems like the DensityAI team has learned from Tesla’s failures and stayed focused on building the best training systems they can using existing GPU hardware, building off of existing state-of-the-art training pipelines, rather than taking the massive risk of building new training hardware from scratch and sinking months and millions of dollars into training runs that fail to converge.

Technically, they monitor the training loss, validation loss, and perplexity, as well as other task-specific metrics.

Sabotaging AI models with GPUHammer

zach — Tue, 22 Jul 2025 15:40:52 GMT

GPUs and AI chips continue to be a huge area of investment, from edge chips like the Jetson to datacenter behemoths like the B200. But one area of research that’s been continuously overlooked is the security of these systems. If we want GPU-powered systems to be making medical decisions or piloting drones, they should be secure. I’m not the only person saying this: since mid-2024, OpenAI has been pushing for secure and trusted GPUs to protect their models. But what OpenAI is proposing simply isn’t secure enough.

Specifically, OpenAI and Nvidia are advocating for trusted execution environments, which keep AI weights and inputs encrypted until they’re securely inside the GPU ready for processing. This is an important security feature, but it has its limitations. Firstly, TEE implementations are often flawed, which can undermine the security they offer. But more importantly, TEEs can’t protect against side-channel attacks like power analysis and electromagnetic analysis -- attacks that GPUs are vulnerable to. Luckily, most side-channel attacks on GPUs require physical access to the chip to attach power or EM probes, making them only suitable for attacking edge devices.

Recently, though, researchers have identified a GPU hardware vulnerability that can be carried out fully remotely on a single shared piece of hardware. It’s called GPUHammer, and it’s an adaption of the classic rowhammer attack to GPU memories. When multiple processes are sharing the same GPU, a malicious process can mount an attack to flip bits in rows of memory it doesn’t have access to, and sabotage the performance of machine learning models running in a different process. Today, we’re going to talk about how it works, what it means, and how it should affect GPU design going forwards.

What is rowhammer, anyways?

Rowhammer is an attack on DRAM that manipulates the underlying physics of the memory cells to flip bits in rows of memory that an attacker shouldn’t have access to. In DRAM, each memory cell is implemented using one capacitor and one transistor, in a design called 1T1C memory. All of these capacitors are packed together so densely in modern DRAM that there’s a meaningful electrical coupling between capacitors in different rows of the memory.

If a malicious process repeatedly reads from the same row in memory, it can disturb the values stored in the memory cells next to that row. Not only can this corrupt data that the malicious process doesn’t have access to, but it can even be used to perform sophisticated attacks like privilege escalation.

DRAM manufacturers have been working to mitigate rowhammer using error correcting codes, automatic refreshing of possible victim rows, and memory scrambling. At the same time, security researchers have reverse engineering scrambling algorithms and finding more complex hammering patterns that can outwit DRAM mitigations. But thus far, all of the research around rowhammer has been focused on CPU memories.

How is GPUHammer different?

While GPUs also use DRAM, they don’t use the same DDR and LPDDR memory that CPUs do. Instead, they use a special kind of DRAM optimized for GPUs, called GDDR memory. GDDR has both a faster refresh rate and a higher latency than DDR and LPDDR memory, which makes hammering harder. At the same time, nobody has taken the time to reverse engineer memory scrambling for GPUs. So while rowhammer should conceptually work on a GPU, the GPUHammer authors are the first researchers to put in the legwork to overcome those obstacles and actually develop a real GPU rowhammer attack.

GPU memory scrambling is uniquely difficult, because Nvidia GPUs never expose physical memory addresses to user-level CUDA code. However, the GPUHammer authors used memory latency patterns to reverse engineer the physical memory addresses from the virtual memory addresses.

Memory access latency versus sub-address, by bank

Another big challenge is that GPU memory accesses are slower than CPUs, which makes hammering much harder. If you try to have a single threaded process repeatedly access the same memory, there won’t be enough memory accesses fast enough to cause a rowhammer effect. However, by leveraging warp-level parallelism, multiple warps can be issuing memory reads simultaneously, and cause enough hammering to actually flip bits in the victim rows.

Single-threaded, multi-threaded, and multi-warp hammering.

These techniques let the researchers isolate which memory addresses to target with a rowhammer attack, and hammer those addresses hard enough to flip bits in adjacent rows. To prove how devastating this attack could be, they used the attack to flip the sign bit in the weights of a machine learning model running in a different process on the shared GPU. This reduced the accuracy of several machine learning models by over 80%.

Essentially, if an attacker can successfully carry out a GPUHammer attack, they can totally sabotage the output of an AI model running on a shared GPU.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

Does this matter?

As AI becomes more and more important and entrenched in our lives, the ability to tamper-proof AI models also becomes important. Thankfully, though, GPUHammer isn’t going to suddenly break ChatGPT. Firstly, GPUHammer has thus far only been demonstrated on A6000 GPUs with GPDDR6. The large cloud H100 and B200 cloud GPUs used by state-of-the-art AI models leverage HBM that has stronger on-die error correction and rowhammer prevention. More importantly, though, most cloud services don’t offer shared GPU workloads; instead, each user gets a certain number of dedicated GPUs, which would mitigate this sort of attack.

However, there are cases where single GPUs run multiple processes concurrently! WebGPU lets websites leverage GPU capabilities of a user’s computer. In theory, a website with WebGPU could perform rowhammer attacks when opened, corrupting data for other concurrent GPU processes including AI models. More generally, whenever multiple processes are sharing the same GPU, that could open the door for a GPUHammer vulnerability.

What do we do now?

GPUHammer just highlights the lack of attention that hardware security has gotten in the world of GPUs and AI accelerators. Most CPUs have supported complex trusted execution environments, secure boot protocols, inline memory encryption, and other security features for years. Security conscious edge microcontrollers for applications in payments, crypto wallets, defense, and critical industry come with even more protections against complex side-channel attacks. CPU DDR memory has been engineered with rowhammer mitigations for almost a decade. But outside of a small number of research papers, building secure AI chips is not a major industry focus.

I think that’s a major problem. As AI starts to influence more and more aspects of our lives, we need to make sure that the hardware it’s running on is secure from first principles. AI chips need post-quantum secure boot, side-channel protections, secure memory, and other key features that are implemented in other high-value, security conscious chips. If we can put that level of security into the chips powering the credit card you use to buy a burrito, we should be able to put it into the chips running AI models for the defense industry.

Why did Tenstorrent just buy Blue Cheetah?

zach — Mon, 07 Jul 2025 14:53:04 GMT

News broke last week that Tenstorrent, the AI startup run by the legendary CPU architect Jim Keller, was acquiring the analog design startup Blue Cheetah. I wasn’t surprised to see Blue Cheetah take a bit of a soft landing; while their product portfolio is genuinely impressive, the chiplet market simply isn’t mature enough to build a huge company there yet. I was much more surprised to see Tenstorrent as the buyer.

As recently as 2024, most of Tenstorrent’s sales haven’t come from selling their Wormhole and Blackhole PCIe cards, or selling chiplets, but instead have come from licensing their RISC-V-based AI chip designs to other companies, especially in the automotive and robotics industries. So if most of Tenstorrent’s revenue comes from licensing IP, and Blue Cheetah’s IP isn’t valuable enough due to the immature chiplet market to meaningfully add to Tenstorrent’s portfolio, why would they buy Blue Cheetah?

More broadly, I know a lot of people have questions about Tenstorrent’s business strategy. They have so many products: RISC-V IP, AI IP, two different PCIe cards, two different versions of a desktop workstation, and both datacenter and cloud product offerings as well. And this is in a world where most other AI chip startups are simply selling LLM inference by the token! What’s even going on here?

Part of this product confusion has to do with Tenstorrent’s complicated corporate history -- Jim Keller, the current CEO, was brought in as CTO years after Tenstorrent was founded, and re-molded the company based on his vision. But it’s unclear how Blue Cheetah’s IP portfolio will fit into Tenstorrent’s product portfolio, or if they simply made the acquisition opportunistically because Blue Cheetah was a company with great technology going through financial struggles.

Why was Blue Cheetah struggling?

Blue Cheetah Analog Design was founded in 2018 by Dr. Elad Alon and Dr. Eric Chang, two UC Berkeley researchers who developed the Berkley Analog Generator, or BAG. Normally, analog design is a highly manual process, with engineers picking the size and shape of every transistor in a design by hand. BAG allowed designers to quickly produce custom analog designs without all of the manual effort.

But instead of trying to sell BAG as a tool, Blue Cheetah decided to build a company that used BAG internally to build analog blocks, and sell the analog designs that were produced. This sort of business model meant that they could deliver products that semiconductor companies were already comfortable licensing and integrating into their designs, while maintaining much lower engineering costs.

The problem arose when they decided which kind of analog designs they wanted to target BAG at. Blue Cheetah decided to primarily focus on die-to-die interfaces for chiplets. I’m entirely speculating here, but I think they probably focused on chiplet interfaces due to the lack of competing IP in the chiplet space. If they were selling more conventional off-die interface IP, they’d be competing with giants like Broadcom and Cadence.

At the time, many people thought a chiplet revolution was coming soon, but it hasn’t entirely played out that way. Chiplet-based architectures are very common, but they usually consist of multiple chiplets designed by a single company, rather than a heterogeneous set of chiplets connected like LEGOs. High bandwidth memory is the one major exception, but HBM uses a completely different die-to-die interface than the Bunch-of-Wires and UCIe interfaces that chiplet IP vendors like Blue Cheetah were selling.

There are a lot of reasons why we don’t see the chiplet market growing significantly. There are concerns about which manufacturer is liable if one chiplet fails. Companies are worried about giving up some of their unique competitive advantages if they sell chiplets on the open market. The number of buyers for chiplets is still small. These are all structural challenges with the market that Blue Cheetah wasn’t equipped to solve, even though they had good technology.

I’m personally still optimistic about both automatic analog circuit generation and chiplets, and I’m especially optimistic about the idea of building silicon IP companies using internal tooling to supercharge your margin structure. Unfortunately, Blue Cheetah tied their fate to the success of the chiplet ecosystem before it was developed enough to support a company of their scale.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

Now, what about Tenstorrent?

Tenstorrent has had a complicated history. Originally founded in 2016 by Ljubisa Bajic, they were one of the “first wave” of AI chip startups, along with Groq, Cerebras, and SambaNova, to recognize that specialized compute for AI would be a large market. Ljubisa and the original Tenstorrent team architected Grayskull, their first chip designed for AI. Grayskull, as well as its successors Wormhole and Blackhole, leveraged SiFive X280 RISC-V AI CPUs as control processors.

However, when Jim Keller joined Tenstorrent as CTO, he started pushing the company towards developing its own RISC-V cores in-house, rather than relying on RISC-V cores licensed from SiFive, an external vendor. This marked a clear departure from Tenstorrent’s original goals of building AI chips. And that departure became even more marked when Tenstorrent started licensing their RISC-V core externally.

This was a confusing move. Usually, companies don’t want to both license chip IP cores and also sell chips. When a company makes both chips and IP, the chips they make could end up competing with the chips their licensees make, which can scare companies away from becoming potential licensees. But because Tenstorrent was just so passionate about RISC-V, they did it anyway. They also licensed their AI cores externally as well.

While this move was confusing, it did seem to pay off for the company. When they closed a $600M+ fundraise in 2024, most of their large deals were IP licensing agreements, rather than sales of physical chips. And apparently, a big reason Tenstorrent claims they were able to close these licensing deals was because of their open source software stack. This, in turn, makes Tenstorrent’s foray into desktop workstations make a little bit more sense, at least.

If Tenstorrent’s IP licensees want an open-source compiler, there need to be open-source developers to work on that compiler. And what do those developers need? Workstations. Do most of them want workstations that cost twelve thousand dollars? Probably not, but apparently Tenstorrent thinks that this is the best way to get their hardware into the hands of open-source developers.

Tenstorrent seems to be running a weird strategy, but at least it makes some sense: make PCIe cards, desktop workstations, and an evaluation cloud to get developers to support your open source ecosystem, and then use that ecosystem to sell RISC-V and AI IP to customers in the automotive industry in Japan. It’s unclear if this strategy will pay off, but this pivot from dedicated chips to RISC-V IP may have been one of the reasons why Ljubisa Bajic and some of the original Tenstorrent founding team left to go found Taalas, an absolutely wild AI chip startup that could be the subject of an entire article on its own.

But even if Tenstorrent’s strategy pays out, it’s unclear how Blue Cheetah factors into that strategy.

Does Blue Cheetah add anything to Tenstorrent?

As far as I can tell, Tenstorrent’s actual PCIe chips aren’t all that impressive, and mostly serve as a platform for developers to contribute to the open-source compiler ecosystem that drives Tenstorrent’s IP sales. So if Tenstorrent primarily acquired Blue Cheetah to put Blue Cheetah IP inside of Tenstorrent’s chips, it doesn't make all that much sense, as chip sales aren’t a major driver of Tenstorrent’s revenue.

But it also doesn’t make sense for Tenstorrent to acquire Blue Cheetah just to continue licensing Blue Cheetah’s IP. As I just discussed, Blue Cheetah’s IP, despite being high-quality, isn’t driving enough revenue to support the company due to the immature chiplet market. So why did Tenstorrent buy Blue Cheetah? Was it just such a good deal due to Blue Cheetah’s financial struggles that they couldn’t say no?

Well, maybe. But also, Blue Cheetah’s technology could factor into Tenstorrent’s plans to sell chiplets rather than just selling IP. They’re working on selling off-the-shelf chiplets for both their RISC-V and AI cores that can be integrated into larger systems for automotive and robotics applications. These chiplets need die-to-die interfaces in multiple different process nodes including Samsung’s SF4X and Rapidus’s new 2nm node. Blue Cheetah’s analog design automation technology may be a key way for Tenstorrent to deliver these chiplet-based designs.

Could Tenstorrent just license Blue Cheetah’s IP to build their chiplets? Sure. But when Blue Cheetah is going through financial struggles, it may be easier to just buy the whole company than some of their products. It’s a weird reality of startups, but I think it’s the most likely motivation for this acquisition.

Why is it so hard for startups to compete with Cadence?

zach — Tue, 01 Jul 2025 14:40:24 GMT

If you talk to any chip designer, they’ll complain about Cadence’s chip design software. They’ll probably complain about Synopsys too, and maybe Mentor Graphics (which is now owned by Siemens, but everybody still calls them Mentor Graphics). These three companies have an oligopoly on chip design software, despite their software being slow, difficult to use, segfaulting constantly, and having user interfaces that look like they were designed in 2003. But if everybody hates these electronic design automation (EDA) tools so much, why hasn’t a cool new startup disrupted the industry?

Well, startups have been trying for years. Way back in 2007, Xoomsys developed parallel Spice simulators to greatly speed up analog and mixed signal simulators, but failed to ever reach mainstream adoption. In 2011, Extreme DA was acquired by Synopsys for their multi-core static timing analysis tools. Rocketick Technologies met a similar fate in 2016, being acquired by Cadence for their multi-core RTL simulator. And in 2025, Partcl is building GPU accelerated chip placement and signoff tools.

In the past, all of these startups either failed or got acquired for $100-500M by the big players like Cadence or Synopsys, and folded into their product portfolio. None of them grew to the size and scale where they could actually break up the Cadence-Synopsys-Mentor oligopoly. The biggest culprit has nothing to do with technology, and everything to do with the unique relationship Cadence, Synopsys, and Mentor have with bleeding-edge fabs like TSMC and Samsung.

The EDA Alliance Axis

TSMC is probably the most important fab in the world. The most valuable chips, from Apple’s M4 CPUs to NVidia B200 GPUs, are made in their foundries in Taiwan. And TSMC maintains close relationships with a small number of EDA vendors, dubbed the “EDA Alliance”. Some of these vendors have their tools officially certified to work with TSMC processes. For example, here are the certifications for TSMC’s 3nm process:

Even the third largest EDA vendor in the world, Siemens, doesn’t have complete certification coverage! It’s incredibly hard for a startup to get a single tool qualified, let alone multiple. And when chip designers working in modern process nodes spend millions of dollars on mask sets and manufacturing runs, they are very unlikely to take a risk on a tool that isn’t officially certified and supported by TSMC.

TSMC is a huge company, without a major incentive to work with startups to certify their tools. When people ask me about competing with Cadence, my tongue-in-cheek response would be that the best first step would be to marry one of the daughters of Morris Chang, TSMC’s legendary founder. It’s a joke, but it rings a bit true -- to make waves in the EDA industry, you need your tools to get certified, and for that, you need to somehow curry favor with TSMC.

This problem is exacerbated by the sheer complexity of modern process nodes. Building static timing analysis tools for the 28nm process node back in 2010 was much, much easier than building the same tool for the modern 2nm process. Modern tools need massive parallelism to fit large designs, and require complex features to incorporate effects of local statistical variation on the behavior of every individual transistor in the design. It’s simply a lot harder to build a good tool to support a modern process node.

Some tools developed by startups have had some success though! How did they pull it off? And more importantly, how can new startups replicate that same success?

Thanks for reading zach's tech blog! This post is public so feel free to share it.

The Startup Rebels

If developing a fully certified and qualified tool as a startup is so hard, how’s a startup to take on the EDA Alliance? Well, even if every chip designer wants their final signoff flow to rely on certified and qualified tools, that doesn’t mean that they can only use those tools.

Most larger chip companies maintain a redundant suite of EDA tools, ranging from simulators to sign-off tools. If your tool is faster, but not certified, it may find a niche in a certain part of the chip design flow. For example, an incredibly fast synthesis and timing analysis tool may enable much faster design space exploration, enabling chip designers to estimate whether their design will meet timing without having to leverage the expensive and slow but fully-qualified tools from Cadence and Synopsys.

Some companies are taking this idea even further. Silimate offers an AI-powered PPA prediction tool that can provide rough estimates of a chip’s power consumption and timing characteristics far faster than a conventional synthesis and signoff tool suite can. Obviously a prediction tool will never be qualified the same way that a signoff tool could be, but it proves the value of fast iteration times independent of signoff-ready certification. Perhaps tools like Partcl could end up in this “complimentary tools” category at first.

This strategy has two major downsides, though. The primary buyers of these sorts of tools are companies that already maintain a suite of redundant EDA tools, and those buyers are usually large companies. Chip design startups simply don’t have the budget to own multiple versions of the same tool. That means that EDA startups selling these “complimentary” tools won’t be able to reap the benefits of selling to other startups. It also could paint a target on their back that’s clearly visible to Cadence and Synopsys. There’s a reason why so many EDA startups either get acquired for less than $500M, or get sued into oblivion by major EDA companies.

Ultimately, if a startup wants to actually take down Cadence and Synopsys, they’ll need to get certified by TSMC, Samsung, GlobalFoundries, and all the other major fabs. But in the meantime, they can build market share by focusing less on replacing the existing qualified tools, but by complimenting them. And if they can withstand the IP lawsuits and enticing acquisition offers, they may even be able to meaningfully compete with the big EDA companies. And all that without marrying Morris Chang’s daughter.

Why did Untether AI fail?

zach — Mon, 16 Jun 2025 13:50:36 GMT

Last week, news broke that Untether AI, an AI inference chip startup, is shutting down. Their engineering team is joining AMD, but their core product lines of AI inference chips and software aren’t coming with them to their acquirer. Obviously, I hope this is a good outcome for the engineers at Untether, but I can’t say I’m particularly surprised. While Untether had good technology and impressive performance for small neural networks, they ran into the same pitfall that many AI chip startups have in the past few years. If a company isn’t focused on large, generative models like diffusion models or LLMs, there’s not a huge market for their chip.

Untether’s First Chip

Untether was founded in 2018 -- notably, before the generative AI boom kicked off by ChatGPT and StableDiffusion. Their original goal was to “untether” AI from datacenters and enable AI inference at the network edge. Notably, this is a bit different from what some other edge AI startups were working on at the time; Untether didn’t want to put their AI chips in edge devices like laptops and ultra-low-power sensors. Instead they wanted to bring edge AI to distributed servers at the periphery of a network. For datacenter AI workloads like recommender models, avoiding routing requests to the network core could reduce latency and improve reliability.

Over time, though, that mission expanded. Their 2019 fundraising announcement focused on automotive and traditional cloud use-cases, alongside a mention of more conventional battery-powered edge devices. At the time, most AI chip startups were either focusing purely on high-power datacenter chips, or ultra-low-power chips for always-on sensors, so it made sense for Untether to focus on niches that other players were avoiding. And their first chip, runAI, followed through on their promises, delivering up to 2 PetaOps per accelerator card and as much as 8 TOPs/watt.

The rise of GenAI

Untether’s second chip, speedAI, was released in 2022, delivering a significant efficiency improvement -- 30 TOPs/watt compared to runAI’s 8 TOPs/ watt. But it was also released at an unfortunate time. SpeedAI was announced just months before the release of ChatGPT, and the refocusing of the entire AI industry around large language models and generative AI. Without HBM, mediocre chip-to-chip connectivity, and limited on- and off-chip memory, speedAI was doomed to be an also-ran in the LLM inference world.

So, Untether doubled down on vision inference, a market that their chips could still succeed in. They partnered with GM to work on autonomous vehicle systems. They released reference applications for their tsn200 accelerator card focused on “smart city” use cases like pedestrian and vehicle detection. And they kept doubling down on vision and sensing applications, collaborating with ARM to develop automotive solutions for driver assistance and self-driving cars, and taking aim at markets in robotics, surveillance, agriculture, and machine inspection.

But while Untether was focusing on vision applications for surveillance cameras, the world was moving on. Large language models became the darlings of the AI world, while computer vision became an afterthought, and for good reason. AI-powered surveillance cameras and cars were still a relatively small market, and I still don’t fully understand what a “smart city” even is. On the other hand, LLMs were clearly disrupting huge industries, and quickly. Because of this, AI chip startups were starting to focus less on vision and convolutional networks, and more on accelerating LLMs. And building LLM accelerators requires different considerations than vision accelerators. Running an LLM quickly simply requires building a large, fast matrix multiply, and keeping it supplied with as much data as possible. Untether’s chips were good at matrix multiplication, but didn’t have the memory capacity, memory hierarchy, or chip-to-chip connectivity to actually tackle LLMs.

When Untether finally announced MLPerf results in 2024, they focused on ResNet-50 inference performance. While their competitors were releasing impressive tokens-per-second results for LLama-70B, Untether was bragging about power efficiency on a model from 2015.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

AI chip startups need to see the future

Untether had good technology. They had a fantastic team -- there’s a reason AMD hired all of them. But they missed one of the biggest developments in the AI world: the rise of large generative models. If they had started a couple years later, and hadn’t taped out their second-generation silicon before the launch of ChatGPT, things could have easily turned out differently.

This is one of the reasons why I think one of the most important things for an AI chip startup to do is keep an ear to the ground for future developments in AI, and to build chips that are flexible enough to run whatever might be coming down the pipeline. Untether AI called their shot that vision models at the network edge were the future, and so their chips weren’t flexible enough to run transformers. Maybe the companies focusing exclusively on transformers will meet a similar fate soon enough.

Why is SambaNova giving up on AI training?

zach — Mon, 05 May 2025 20:42:37 GMT

Disclaimer: I worked at SambaNova Systems between 2019 and 2021, and so I am going to be somewhat careful about discussing sensitive information. Despite the tongue-in-cheek subtitle, I don’t want to get in legal trouble. Please don’t sue me!

In late April, one of the most well-funded AI chip startups out there, SambaNova Systems, pivoted away significantly from their original goal. Like many other AI chip startups, SambaNova wanted to offer a unified architecture for both training and inference. But, as of this year, they’ve given up on their training ambitions, laid off 15% of their workforce, and are focusing entirely on AI inference. And they’re not the first company to be making this pivot.

In 2017, Groq was bragging about their training performance, but by 2022, they were entirely focused on inference benchmarks. The Cerebras CS-1 was originally sold primarily for training workloads, but the CS-2 and later shifted their focus to inference. SambaNova seemed to be the last holdout from that first generation of AI chip startups to still seriously focus on training, but that’s finally changed. So, why are all of these startups pivoting from training to inference? Luckily, as somebody who worked at SambaNova, I have a bit of an insider’s perspective.

Training was always part of the plan

SambaNova was very serious about training models on their hardware. They put out articles about how to train on their hardware, bragged about their training performance, and addressed training in their official documentation. A lot of analysts and outside observers, including me, saw the ability to tackle both the inference and training markets with one chip as a unique edge that SambaNova had over competitors like Groq, which was one of the earliest startups to pivot to inference.

SambaNova also invested significant time and effort into enabling efficient training. When I was at the company between 2019 and 2021, I spent a considerable amount of time implementing a kernel for the NAdam optimizer, a momentum-based optimizer commonly used to train large neural networks. We had hardware and software features designed and optimized for training, and both internal and external messaging indicated that support for training was a key part of our value proposition.

Now, all of a sudden, SambaNova is essentially scrapping most of that work to focus entirely on inference. I think that they’re doing this for three main reasons: the fact that inference is an easier problem to tackle, the fact that inference may represent a larger market than training, and Nvidia’s total dominance in the world of AI training chips.

Inference is an easier, larger market.

Many analysts believe that the market size for AI inference could be ten times bigger than the market for AI training. Intuitively, this makes sense. Normally, you only train a model once, and then perform inference using that model many, many, many times. Each time you run inference, it costs far, far less than the entire training process for a model — but if you run inference using the same model enough times, it becomes the dominant cost when serving that model. If the future of AI is a small number of large models, each with significant inference volume, the inference market will dwarf the training market. But if many organizations end up training their own bespoke models, this future may not come to pass.

But even if inference doesn’t pan out to be a much larger market than training, there are technical reasons why inference is easier to tackle for AI chip startups. When training a model, you need to run a bunch of training data though that model, collect gradient information during the model’s operation, and use those gradients to update the model’s weights. This process is what allows the model to learn. It’s also extremely memory-intensive, as you need to cache all of those gradients as well as other values, like the model’s activations.

So, to efficiently perform training, you need a complex memory hierarchy with on-die SRAM, in-package HBM, and off-chip DDR. It’s hard for AI startups to get their hands on HBM, and hard to integrate HBM into a high-performance system -- so many AI chips like Groq and d-Matrix just don’t feature the necessary HBM or DDR capacity or bandwidth to efficiently train large models. Inference doesn’t have this problem. During the inference process, gradients don’t need to be stored, and activations can be discarded after they’re used. This vastly reduces the memory footprint of inference as a workload, and reduces the complexity of the memory hierarchy inference-only chips need.

Another challenge is inter-chip networking. All of those gradients generated during training need to be synchronized across every chip used in the training process. That means you need a large, complex, all-to-all network to efficiently run training. Inference, on the other hand, is a feed-forward operation, with each chip only talking to the next chip in the inference pipeline.1 Many startups’ AI chips have limited networking capabilities, which makes them poorly suited for the all-to-all connectivity that training training, but sufficient for inference workloads. Nvidia, on the other hand, has addressed both the memory and networking challenges for AI training extremely well.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

Nvidia is extremely good at training.

Nvidia has been the hardware of choice for both inference and training since the AlexNet days in 2012. Because of the versatility CUDA grants GPUs, they’re capable of performing all of the necessary operations for both training and inference. And in the past decade, Nvidia has been focused not just on building hyper-optimized chips for machine learning workloads, they’ve also been optimizing their entire memory and networking stack for large scale training and inference.

With access to significant amounts of HBM on every die, Nvidia hardware is able to easily and efficiently cache all of the gradient updates generated by each training step. And with scale-up technologies like NVLink and scale-out technologies like Infiniband, Nvidia hardware is able to handle the all-to-all networking required to update all of the weights of a large neural network after each training step is complete. Inference-only competitors like Groq and d-Matrix simply lack the memory and networking capabilities to compete with Nvidia on training.

But SambaNova chips do have HBM. SambaNova chips have a peer-to-peer network at both the server-level and rack-level. Why can’t they tackle training the way that Nvidia can?

Well, it turns out that Nvidia has more than just HBM and networking to give them a leg up on training performance. They’ve put significant effort into low-precision training, and top AI labs, in turn, have put significant effort into tuning algorithm hyper-parameters to work well with the specific intricacies of Nvidia’s low-precision training hardware. Shifting from Nvidia to SambaNova chips for training requires changing extremely sensitive training code to run on entirely new hardware with an entirely new set of pitfalls. The cost and risk of doing that for a large, GPT-4 scale model is immense.

SambaNova’s pivot to inference is proof that, even if an AI chip startup manages to offer competitive memory and networking capabilities to Nvidia, it’s not enough to take on the green giant in the training market. If a startup wants to challenge Nvidia on training, they need to offer such impressive training performance, that they can overcome Nvidia’s inertia in the training market. And so far, nobody’s been able to pull that off.

This is an oversimplification, but serves to illustrate the different networking requirements for training and inference.

Why is it taking so long to build new IP cores?

zach — Mon, 28 Apr 2025 14:50:32 GMT

The USB4 specification was first released in 2019. It took a year for the first USB4 IP cores to hit the market. Even worse, it wasn’t until 2025 that the first USB4 IP was fully certified by the USB implementers forum. And USB4 isn’t the only connectivity protocol where there was a major lag between the specification being finalized and the first IP being available.

PCIe 4.0 was notably delayed; the PCIe 4.0 standard was released in 2017 but it took AMD and Intel until 2019 and 2020, respectively, to ship CPUs with PCIe 4.0 support. It got to a point where a new industry consortium spun up to find alternatives to the delayed standard. Similar stories exist for DisplayPort 2.0: it was released in 2019, but major vendors like Synopsys don’t even offer DisplayPort 2.0 cores at all in 2025.

And it’s not just connectivity IP cores that often experience major delays from specification to commercial release. The same is true for video codecs: HEVC H.265 was released in January 2013, but IP didn’t hit the market until 2014, and only from small, specialized video IP vendors. AV1 was standardized in 2018, and it took until late 2020 for IP to hit the market. VVC H.266 was standardized in 2020, and IP wasn’t released until 2024.

In the last decade or so, we’ve seen more and more examples of commercial IP releases lagging significantly behind the release of corresponding specifications and standards. This is becoming a major problem; widely available IP cores are necessary for an ecosystem to grow around a new standard. For example, if DisplayPort 2.0 IP isn’t available, it becomes harder for CPU, GPU, and display vendors to justify building DP 2.0 capabilities into their devices, because other devices consumers own won’t support DP 2.0. The same is true for video codecs, cryptographic algorithms, and memory interfaces.

But why is it taking so much longer to build new IP cores?

Standards are getting more complex

First off, the standards that IP vendors are expected to implement are getting more and more complex over time. Let’s take a look at a state machine from the USB 1.0 specification, released in 1996. The entire spec is a little over 250 pages long, and includes everything you need to know about USB 1.0, from the logical protocol to the electrical specifications to the mechanical drawings of the connector. This state machine handles “bit stuffing”, which ensures that the data doesn’t have too many consecutive 1’s or 0’s.

USB Bit Stuffing Diagram

This single diagram includes basically all of the information you need to implement bit stuffing on USB 1.0. It’s a simple and effective solution for the data transfer speeds that USB 1.0 runs at.

USB 4.0, on the other hand, is a much more complex specification. The core spec is over 800 pages long, and that doesn’t include many of the mechanical specifications of the connector. USB 4.0 also has a method to prevent long runs of consecutive 1’s or 0’s, but it’s much more complex. It uses a scrambler, which randomly shuffles data according to a pseudorandom number generator, or PRNG. There are many pages of specifications for this scrambler, as both the transmitter and receiver need to have their PRNGs synchronized to ensure the message gets properly encoded and decoded. Below is a diagram of how USB 4.0 handles re-synchronizing the transmitter and receiver’s scramblers, depending on what mode it’s operating in.

USB 4.0 Scrambler

As you can see, USB 4.0 is already a lot more complex, and this is just a single diagram that explains a single part of the scrambler resynchronization process. There are pages and pages of additional information on how to properly implement this feature, and it’s one of many, many complex features in the USB 4.0 specification.

USB 4.0 also has entirely new features, like link training. This is a complex handshake between the transmitter and the receiver to tune their equalization settings to make the communication link more reliable. USB 1.0 didn’t need this at all, as it was simple and low-speed enough to get away with straightforward, constant equalization methods. But USB 4.0 has multiple different link training methods, depending on whether you’re using the USB 4.0 bus to transmit DisplayPort data or regular USB3.0-style serial data.

USB 4 DisplayPort Link Training

All of these features are really important, though. USB 4.0 can run at up to 80 Gbit/s, while USB 1.0 only ran at 12 Mbit/s. That means that USB 4.0 is over 6000 times faster than USB 1.0, and those performance gains had to come from somewhere. To reliably transmit data so much faster, much more complex transmitter and receiver architectures are necessary. But that makes the process of designing IP cores for these new standards that much harder. And it gets even harder when IP core vendors have to support so many different IP cores at once.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

IP vendors are stretched thin

IP vendors have large, complex product portfolios they need to maintain. Even though USB 3.0 and 4.0 have been released, some low-speed, low-cost devices still use USB 2.0 IP cores, and IP vendors are often expected to support those legacy protocols alongside more modern ones. Maintaining those cores isn’t free, though. Often, customers will want tweaks to customize legacy IP cores, which requires engineering effort that could otherwise be spent developing new IP cores to meet new standards.

As an example, Synopsys has an incredibly wide portfolio of IP cores, and about 20,000 employees. Of course, some of those employees work on Synopsys’ EDA products, but a good portion of them work on IP development. The overhead of managing that many employees to maintain that many products is huge, and prevents Synopsys from moving quickly to assign their best and brightest employees to tackle IP core development for cutting-edge standards.

But this also creates opportunities for startups, who don’t have that overhead, to find niches where they can outcompete those big players like Synopsys.

Opportunities for disruption

As new standards for connectivity, cryptography, video compression, and memory get released, new opportunities for startups are created. As I mentioned, large IP companies like Cadence, Synopsys, and Rambus are primarily focused on maintaining a large, broad product portfolio. Startups, unencumbered by the need to maintain existing product lines, can focus on out-executing incumbents on one particular new standard.

My last startup, Radical Semiconductor, did this for post-quantum cryptography, and we were acquired in 2024. And we’re not the only example. NGCodec was one of the first providers of FPGA H.265 IP, and was acquired by Xilinx in 2019. PLDA beat many major IP providers to silicon-ready IP for PCIe 5.0 and CXL 2.0, and were acquired by Rambus in 2021.

Most of these IP core startups end up being acquired by an existing IP vendor, rather than becoming a successful IP company in their own right. But potentially, with the advent of AI-powered chip design, a startup could stay at the forefront of new standards and specifications across multiple product lines, and become a legitimate rival to IP giants like Synopsys and Cadence. Either way, regardless of a founder's ambitions, new standards and specifications have historically been great opportunities for startups to disrupt incumbents, and I suspect that trend will continue into the future.

Scaling Thermodynamic Computing

zach — Wed, 16 Apr 2025 12:28:25 GMT

Note: Since this post was published, I’ve transitioned to an advisory role at Normal Computing, and taken a full-time role leading hardware security at BTQ Technologies, which acquired my previous startup, Radical Semiconductor, in 2024. This post may not reflect the current architecture or plans of Normal Computing whenever you’re reading this.

This is a crosspost between my personal blog, zach.be, and the official Normal Computing blog (https://www.normalcomputing.com/blog)!

Last week, I gave a talk at Normal Computing’s event at NYC Deep Tech Week. This was a pretty exciting time – not only did we have a full slate of excellent speakers and panels, but I also got to publicly share some more details – for the first time ever – about the chip that we’re actually building here at Normal. Specifically, I went over how our silicon architecture enables unprecedented scalability for thermodynamic computing, with a focus on large-scale generative design, scientific computing, and probabilistic reasoning algorithms.

Because not everybody is fortunate enough to live in NYC, I wanted to also write a short blog post summing up what I presented in that talk. If this sort of thing piques your interest, I’d recommend applying to work at Normal. We have a world-class team in SF, NYC, London, and Copenhagen, and we’re entering an important scaling phase now. Our work is in significant part funded by the ambitious Advanced Research + Invention Agency (ARIA) Scaling Compute Program.

Scaling laws: today and looking ahead

AI workloads are getting bigger, and more than 300,000 GPU clusters are in the works in 2025. Frontier labs are starting to do their first multi-data-center distributed training runs. And in the next 5 years we’ll need gigawatt data centers, to scale these runs by 10,000x.

The last few years of AI development have been focused on scaling laws. For LLM pre-training, these laws tell us that with more data and compute, we can predictably improve the quality of our AI models. There are clear paths to getting more data; for example, online video data likely contains trillions of additional tokens to train on.

OpenAI, 2024

These scaling laws demand ever increasing amounts of compute, and chip startups have popped up to meet that demand. Some chips are trying to deliver more compute through specialization; others are leveraging emerging technologies like processing-in-memory to break through key system-level bottlenecks. These methods may give 5-10x performance improvements over GPUs for some use-cases, at the expense of technical risk or market risk. But even these new chip architectures won’t be sufficient as AI models continue to scale. Experts believe that, within five years, AI data centers will start hitting fundamental memory and compute bottlenecks.

Gholami et. al, 2024

Current LLMs may be good enough for some tasks, but as we scale to more complex workloads, like video diffusion or complex reasoning with uncertainties, we’ll start to hit those bottlenecks even faster. Ultimately, the race to build AI is not just so we can do homework faster or draft emails. From Jensen to the AI godfathers, experts are in agreement that the next frontier for intelligence is in advancing the capabilities to reason about the physical world.

These applications are the ones we typically find ourselves most excited about: from redesigning hardware and chips themselves, to achieving autonomy in transportation, to scaling higher fidelity forms of planning and reasoning compared to the heuristics we have today.

Simulating the physical world

What’s interesting about many of these algorithms, which can natively reason about the world, is there is a nice framework to write them down – though it is not matrix multiplication. From diffusion, to transition path sampling, to stochastic linear algebra, we can view these algorithms as simulating thermodynamics. Specifically, we can view them as Langevin dynamics: a general stochastic differential equation, or SDE, that represents how a system evolves under both deterministic and random forces.

We’re focused on accelerating algorithms that can be expressed as these special SDEs. Similar to how quantum computers emulate Schrodinger's equation, thermodynamic computers emulate Langevin's equations, the natural regime for most of the pertinent physical world: this is a broad class of algorithms that often relate to simulating real-world systems. For example, SDEs are commonly used to simulate problems in molecular dynamics, fluid dynamics, and materials science.

But these SDEs aren’t just useful for simulating physics! They also have valuable applications in machine learning. Most straightforwardly,diffusion ML models are based on stochastic differential equations. But there are also many other ML applications that can be expressed as SDEs; for example, the probabilistic sampling required for reasoning withBayesian neural networks and energy-based models can be formulated as the simulation of a Langevin equation.

Blattmann et al, 2021

Algorithms that have to simulate and reason about the physical world, whether they’re powered by classical simulation techniques, by deep learning, or by a combination of both, have the potential to create $50T+ value. But to actually deliver on that promise, they need to break through key computational bottlenecks. Sampling from the trajectory of an SDE is difficult on current hardware. At Normal Computing, we’re building new hardware to accelerate sampling, and building new algorithms that can reason about the physical world.

Thanks for reading zach's tech blog! This post is public so feel free to share it.

In practice: current hardware for sampling is limited

Back in the early days of computational methods, engineers would simulate SDEs and run sampling algorithms, like Markov Chain Monte Carlo, on CPUs. These algorithms can accurately approximate continuous state spaces, and could get asymptotically close to arbitrary probability distributions with enough samples. But these SDE simulations and Markov chain algorithms are fundamentally serial, drawing one sample at a time from a chain. This limits their performance and scalability, especially as the scale and complexity of sampling problems grows.

Sampling: CPUs, GPUs, p-bits, analog s-units

Eventually, people started applying GPU acceleration to sampling algorithms and to SDEs, inspired by the unreasonable effectiveness of GPU acceleration on other workloads, like conventional deep learning. GPUs offer better performance through parallelism, but the effectiveness of that parallelism is limited on these sorts of algorithms. Running multiple parallel Markov chains generates more samples faster, but doesn’t eliminate burn-in time or other statistical artifacts; that can only be improved by running longer chains, rather than more chains.

To really break through performance bottlenecks, we would need an ASIC for sampling. Since 2019, researchers at Purdue and UCSB have been working on probabilistic bits, or p-bits. Each p-bit takes on a probabilistic value between 0 and 1, and interacts with other p-bits through a sparse matrix of interaction terms. These systems can sample from certain classes of distributions extremely quickly, outputting a valid sample every single clock cycle.

However, they have to make a lot of sacrifices to achieve that performance. First of all, they only sample from distributions with binary state variables. While a clever programmer can construct more complex states from binary state variables, doing so requires significant overhead. Constructing 32-bit state variables from single-bit p-bits requires 1024x as many coupling terms. Also, non-sparse interaction matrices can cause frustrated sampling, and significantly degrade sample quality. p-bits are well suited to some problems, including simulating spin glasses and binary optimization problems, but I don’t believe them to be well-suited for general sampling problems. Multiple startups have been founded to try to commercialize p-bit technology, but all seem to be lagging behind Camsari’s group in terms of practical scalability, so I’m primarily citing his group’s work, which represents the state-of-the-art in p-bits.

Table: CPUs, GPUs, p-bits, analog s-units

At Normal, we’re working on building sampling hardware that’s expressive, scalable, and reconfigurable. Our initial proof-of-principle was all-analog, and proves that custom sampling hardware with full-precision states and fully parallel updates is possible. The big challenge with purely analog sampling hardware is its lack of reconfigurability; that first prototype could only sample from Gaussian distributions, which limited its ability to support general sampling workloads. Our new architecture takes the key advantages of that first prototype – namely, its support for parallel updates of full-precision state variables – and develops it into a reconfigurable design that can scale up in silicon.

Normal Computing: Precise, expressive, and scalable

Like I mentioned earlier, we want to be able to sample from a complicated, nonlinear distribution, with a high-dimensional, continuous state space. This means that our sampling hardware needs to fulfill three key properties:

Express this state space precisely, even for continuous variables.
Support expressive, parameterizable distributions.
Scale to large numbers of state variables.

Precise, Expressive, Scalable

Our architecture, Carnot, is designed from the ground up to solve these three challenges.

Its state variables are inherently multi-level, enabling us to accurately represent 32-bit numbers without an explosion in the number of required interaction terms. And instead of limiting our nonlinear functions to just sigmoid or ReLU, we’re implementing reconfigurable nonlinear functions, parameterized as the linear combination of an expressive set of basis functions. Notably, this means that we need to store and compute on our state variables digitally. In this architecture, the states of s-units are no longer analog voltages on capacitors, but instead 32-bit numbers in digital registers.

Finally, we’re implementing a scalable and efficient interaction matrix using some clever architectural techniques. This is one of the most important aspects when it comes to scaling up a system, as an N-dimensional system requires N² interaction terms. That means that large interaction terms implemented as binary multipliers will struggle to scale for dense interaction matrices; it also means that many all-analog approaches will struggle due to parasitic and mismatch effects.

Significant power and area efficiency that will lead to high scalability

More concretely, the Carnot architecture will feature a set of compute tiles, connected together with a network-on-chip (NoC). Each tile will contain a set of digital s-units to store state, a set of reconfigurable nonlinear function units, and an efficient and compact interaction matrix,1 to compute the interactions between states as the system updates. Mapping a sampling problem to this chip entails describing the desired distribution with a stochastic differential equation (SDE), and using the chip to evolve that SDE over time.

By combining these three techniques, we can build a system that solves the three biggest challenges of designing efficient sampling hardware:

Multi-bit s-units allow a single tile to quickly compute complicated quantities, like a 64x64 linear system solution to 32-bit precision.
By expressing nonlinearities as a linear combination of basis functions, we can support many different kinds of distributions.
Our interaction matrix achieves high TOPS/(Wum²), representing significant power and area efficiency that will lead to high scalability.

Transition path sampling: emulations of Normal silicon

Our silicon roadmap

And as part of building this technology, we have taken the great responsibility to not do it alone. We currently are partnered with a half dozen of the most important institutions in the world. We’ve deployed our EDA and other physical (world) AI products at some of these companies. And together we are building a roadmap to establish standards for this compute paradigm.

Our chip architecture is fundamentally scalable. But there are still a significant number of engineering challenges that need to be solved to actually scale this system to production.

We’ll be demonstrating our first chip, which we’re dubbing Carnot CN101, this year. With four compute tiles containing 64 state variables each, and a reconfigurable NoC to enable tile-to-tile communication, this chip will be able to support 256-dimensional problems with 32-bit state variables. We’ll de-risk key engineering challenges, including the asynchronous parallel state updates that enable our architecture’s high performance. CN101 is still a test chip, with relatively low-speed I/O interfaces, but will serve as the first demonstration that scalable thermodynamic computing is a reality.

In 2026, we want to make key architectural and circuit-level improvements to the Carnot architecture. We’ll have a more compact interaction matrix, a network-on-chip capable of supporting many more tiles, and support for additional vector math operations. But more importantly, we’ll introduce on-chip control and caching hardware so that the chip is capable of sampling from multiple different distributions sequentially without off-chip data transfer bottlenecks. We could even sample from one distribution parameterized by samples from a different distribution!

Finally, to scale up our system to production scale, we need to contend with all of the practical engineering challenges of developing reticle-size chips in modern process nodes with modern interface IP. This is a huge and intensive engineering challenge, but once we solve it, we’ll be able to deploy thermodynamic computers at data center scale. Each chip will be able to support tens of thousands of state variables, and sample from complex distributions orders of magnitude more efficiently than GPUs. That means that the computational bottlenecks in key algorithms, including physical simulation workloads and probabilistic machine learning, will get solved. This will, in turn, enable new sorts of AI models to reason about the physical world, and reason with uncertainty.

All performance numbers for conventional approaches are normalized to the process node we’re manufacturing our first test chips in.