Solving the Randomness Bottleneck in Stochastic Computing
Could AI revive an architecture from the 1950s?
Stochastic computing is an unconventional computing paradigm that dates back to the 1950s. Introduced by John von Neumann himself, this method of computing was attractive at the time because of the limitations of early computer systems. Components were bulky and often unreliable, which meant that they couldn’t be used to build modern logic circuits.
Stochastic computing solved this issue in an elegant way. Operands are represented with stochastic numbers: random bitstreams that switch back and forth between 0 and 1 over time. The value that an operand represents is equal to the probability that its bit is 1 at any given point in time.
In conventional digital logic, a multiplier is a complex, large circuit. But in the world of stochastic computing, a multiplier is just a single AND gate! For early computers where gates were at a premium, this was a big deal. And because all of our numbers are stochastic, these systems can tolerate some amount of errorsl. If there are a few bits flipped in our bitstream, it’s no big deal!
But as computing technology improved, stochastic computing declined. Logic gates became small and reliable, so many of the advantages of stochastic computing systems stopped mattering. At the same time, one of the major disadvantages of stochastic computing began to rear its ugly head: stochastic computing requires a lot of random values. If those random values are correlated with one another, the system can fail in bizarre ways.
One classic failure case is multiplying a number by itself. If we generate one bitstream for a value n and calculate n2 with an AND gate, our calculation returns n2 = n. To solve this problem, we need to generate two uncorrelated bitstreams, na and nb, using different noise sources, and multiply those together with an AND gate to get the real value of n2.
Essentially, this means that we need many different random noise sources to make our stochastic system work. Even if we generate noise pseudorandomly using simple techniques, the overhead of generating all this noise makes stochastic systems less efficient than conventional systems for general purpose processing.
But what if we don’t need to do general purpose processing?
Stochastic Computing for AI Acceleration
In my opinion, stochastic computing shows a lot of promise for AI applications. Current AI chips are large, power-hungry behemoths that simply can’t keep up with new AI workloads. Stochastic computing offers relatively compact and low-power logic functions that could help AI chips overcome those limitations. At the same time, AI workloads can tolerate some small errors in their outputs, so the main drawback of stochastic computing -- its inherent lack of exactness -- isn’t a disqualifier.
I believe that stochastic computing has a particularly promising future in Bayesian machine learning. Bayesian machine learning models are unique in their ability to express uncertainty: a Bayesian model will not only tell you what its output is, but also how likely that output is to be correct. This makes them well-suited to critical applications in the financial, medical, and industrial sectors.
However, Bayesian models are very expensive to run on conventional computers. They often require the same operation to be run hundreds or thousands of times on different data samples to generate probability distributions. Stochastic computing architectures already natively operate on probability distributions, which makes them well-suited to implement Bayesian models efficiently at scale. Khoram et al and Xia et al both demonstrate the effectiveness of stochastic computing for Bayesian models, each achieving significant improvements in speed and power consumption over non-stochastic architectures.
One other benefit of stochastic computing is its ability to gracefully tolerate errors. This opens up a new set of potential optimizations that offer significant efficiency gains at the cost of occasional errors in the bitstream. For example, voltage over-scaling is a common approximate computing technique that lowers a chip’s supply voltage below safe limits to save a significant amount of power at the cost of occasionally introducing errors. Similar techniques could be applied to stochastic computing schemes to offer unprecedented efficiency for AI applications.
But none of these benefits address the elephant in the room: the number of random noise sources we need in order to implement a stochastic computing system. AI algorithms have many different variables, with millions or billions of weights and large input and output vectors. Shouldn’t that require an impractical number of noise sources?
Luckily, it doesn’t. Thanks to some new research from the University of Washington, plus careful analysis of operations required for accelerating AI processing, we can significantly reduce the number of noise sources required for a stochastic AI accelerator. Here’s how.
Matrix Multiplication with Only Two Noise Sources
Matrix multiplication is the core operation behind many different AI applications. And matrix multiplication is, at its core, a very simple operation: it only consists of multiplication and addition.
Traditionally, all stochastic operations require uncorrelated inputs, but in 2018, a paper from Vincent Lee at the University of Washington introduced a new stochastic adder that can operate on correlated inputs. That means we now only need two noise sources: one for the A matrix, and one for the B vector.
I wrote up a simple python script as a proof of concept, and got these results:
C1 Expected: 4.625 | C1 Actual: 4.625049591064453
C2 Expected: 2.625 | C2 Actual: 2.6243629455566406
As you can see, this method successfully implements a stochastic matrix multiplier. If this was implemented on an FPGA, it could potentially offer high-performance, low-overhead matrix multiplication of stochastic values.
But an ASIC implementation would be even more exciting. An ASIC could leverage techniques like voltage over-scaling that would occasionally introduce errors into the bitstream, but significantly reduce the power consumption of the system. Because stochastic computing can withstand the errors, this could offer significant improvements over state-of-the-art in power-efficient AI acceleration.
Only one random number source is needed since permutating the random source could generate another random source pair of minimum correlation. You can refer to this paper from 2020, 10.1109/TVLSI.2019.2963678