In a preprint paper coauthored by Google AI lead Jeff Dean, scientists at Google Research and the Google chip implementation and infrastructure team describe a learning-based approach to chip design that can learn from past experience and improve over time, becoming better at generating architectures for unseen components. They claim it completes designs in under six hours on average, which is significantly faster than the weeks it takes human experts in the loop.
While the work isn’t entirely novel — it builds upon a technique proposed by Google engineers in a paper published in March — it advances the state of the art in that it implies the placement of on-chip transistors can be largely automated. If made publicly available, the Google researchers’ technique could enable cash-strapped startups to develop their own chips for AI and other specialized purposes. Moreover, it could help to shorten the chip design cycle to allow hardware to better adapt to rapidly evolving research.
“Basically, right now in the design process, you have design tools that can help do some layout, but you have human placement and routing experts work with those design tools to kind of iterate many, many times over,” Dean told VentureBeat in an interview late last year. “It’s a multi-week process to actually go from the design you want to actually having it physically laid out on a chip with the right constraints in area and power and wire length and meeting all the design roles or whatever fabrication process you’re doing,” said Dean. “We can essentially have a machine learning model that learns to play the game of [component] placement for a particular chip.”
The coauthors’ approach aims to place a “netlist” graph of logic gates, memory, and more onto a chip canvas, such that the design optimizes power, performance, and area (PPA) while adhering to constraints on placement density and routing congestion. The graphs range in size from millions to billions of nodes grouped in thousands of clusters, and typically, evaluating the target metrics takes from hours to over a day.
The researchers devised a framework that directs an agent trained through reinforcement learning to optimize chip placements. (Reinforcement learning agents are spurred to complete goals via rewards; in this case, the agent learns to make placements that will maximize cumulative reward.) Given the netlist, the ID of the current node to be placed, and the metadata of the netlist and the semiconductor technology, a policy AI model outputs a probability distribution over available placement locations, while a value model estimates of the expected reward for the current placement.
In practice, starting with an empty chip, the abovementioned agent places components sequentially until it completes the netlist and doesn’t receive a reward until the end, when a negative weighted sum of proxy wavelength (which correlates with power and performance) and congestion is tabulated (subject to density constraints). To guide the agent in selecting which components to place first, components are sorted by descending size; placing larger components first reduces the chance there’s no feasible placement for it later.
Training the agent required creating a data set of 10,000 chip placements, where the input is the state associated with the given placement and the label is the reward for the placement (i.e., wirelength and congestion). The researchers built it by first picking five different chip netlists, to which an AI algorithm was applied to create 2,000 diverse placements for each netlist.
In experiments, the coauthors report that as they trained the framework on more chips, they were able to speed up the training process and generate high-quality results faster. In fact, they claim it achieved superior PPA on in-production Google tensor processing units (TPUs) — Google’s custom-designed AI accelerator chips — as compared with leading baselines.
“Unlike existing methods that optimize the placement for each new chip from scratch, our work leverages knowledge gained from placing prior chips to become better over time,” concluded the researchers. “In addition, our method enables direct optimization of the target metrics, such as wirelength, density, and congestion, without having to define … approximations of those functions as is done in other approaches. Not only does our formulation make it easy to incorporate new cost functions as they become available, but it also allows us to weight their relative importance according to the needs of a given chip block (e.g., timing-critical or power-constrained).”
Source: Read Full Article