DeepMind’s MuZero picks up the rules of games as it plays

In a paper published in the journal Science late last year, Google parent company Alphabet’s DeepMind detailed AlphaZero, an AI system that could teach itself to master the game of chess, a Japanese variant of chess called shogi, and the Chinese board game Go. In each case, it beat a world champion, demonstrating a knack for learning two-person games with perfect information — that is to say, games where any decision is informed by all the previous events.

But AlphaZero had the advantage of knowing the rules of games it was tasked with playing. In pursuit of a performant machine learning model capable of teaching itself the rules, a team at DeepMind devised MuZero, which combines a tree-based search (where a tree is a data structure used for locating information from within a set) with a learned model. As described in a Nature paper published today, MuZero predicts the quantities most relevant to game planning such that it achieves industry-leading performance on 57 different Atari games and matches the performance of AlphaZero in Go, chess, and shogi.

Dave Silver, who leads the reinforcement learning group at DeepMind, says MuZero paves the way for learning methods in a host of real-world domains, particularly those lacking a simulator or dynamics rules. “We think this really matters for enriching what AI can actually do because the world is a messy place. It’s unknown — no one gives us this amazing rulebook that says, ‘Oh, this is exactly how the world works,’” he told VentureBeat in a phone interview last week. “If we want our AI to go out there into the world and be able to plan and look ahead in problems where no one gives us the rulebook, we really, really need this.”

Model-based reinforcement learning

The ability to plan allows humans to solve problems and quickly make decisions about the future. In the AI domain, researchers have attempted to replicate this by using approaches called lookahead tree search or model-based planning. Systems that use lookahead search, such as AlphaZero, have achieved remarkable success in classic games like checkers, chess, and even poker.

But lookahead search requires knowledge of an environment’s dynamics, like the rules of a game or an accurate physics simulator. Model-based systems aim to address this by learning a detailed model of an environment and using it to plan. But the complexity of modeling has historically meant these algorithms haven’t been able to compete in visually rich domains.

Enter MuZero, which combines a model with AlphaZero’s lookahead tree search. Instead of trying to model an entire environment using an algorithm, MuZero models only aspects it determines are important to the decision-making process.

MuZero receives observations — i.e., images of a Go board or Atari screen — and transforms them into a mathematical representation called a “hidden state.” This hidden state is updated iteratively by a process that receives the previous state and a hypothetical next action. At every step, the model predicts the policy (e.g., the move to play), value function (e.g., the predicted winner), and immediate reward (e.g., the points scored by playing a move).

Intuitively, MuZero internally invents game rules or dynamics that lead to accurate planning.

As the DeepMind researchers explain, one form of reinforcement learning — the technique at the heart of MuZero and AlphaZero, in which rewards drive an AI agent toward goals — involves models. This form models a given environment as an intermediate step, using a state transition model that predicts the next step and a reward model that anticipates the reward.

Model-based reinforcement learning commonly focuses on directly modeling the observation stream at the pixel level, but this level of granularity is computationally expensive in large-scale environments. In fact, no prior method has constructed a model that facilitates planning in visually complex domains such as Atari. The results lag behind well-tuned model-free methods, even in terms of data efficiency.

For MuZero, DeepMind instead pursued an approach focusing on end-to-end prediction of a value function, where an algorithm is trained so the expected sum of rewards matches the expected value with respect to real-world actions. The system has no semantics of the environment state but simply outputs policy, value, and reward predictions, which an algorithm similar to AlphaZero’s search (albeit generalized to allow for single-agent domains and intermediate rewards) uses to produce a recommended policy and estimated value. These in turn are used to inform an action and the final outcomes in played games.

Training and experimentation

The DeepMind team applied MuZero to the classic board games Go, chess, and shogi as benchmarks for challenging planning problems, and to all 57 games in the open source Atari Learning Environment as benchmarks for “visually complex” reinforcement learning domains. They trained the system for five hypothetical steps and a million mini-batches (i.e., small batches of training data) of size 2,048 in board games and size 1,024 in Atari, which amounted to 800 simulations per move for each search in Go, chess, and shogi and 50 simulations for each search in Atari.

With respect to Go, MuZero slightly exceeded the performance of AlphaZero despite using less overall computation, which the researchers say is evidence it might have gained a deeper understanding of its position. As for Atari, MuZero achieved a new state of the art for both mean and median normalized scores across the 57 games, outperforming the previous state-of-the-art method (R2D2) in 42 out of 57 games and outperforming the previous best model-based approach in all games.

The researchers next evaluated a version of MuZero — MuZero Reanalyze — that was optimized for greater sample efficiency, which they applied to 75 Atari games using 200 million to 20 billion frames of experience per game. MuZero Reanalyze could repeatedly use its learned model to improve its planning rather than collecting new data from the environment.

The team reports that MuZero Reanalyze managed a 731% median normalized score compared to previous state-of-the-art model-free approaches IMPALA, Rainbow, and LASER, which received 192%, 231%, and 431%, respectively. The team also notes that MuZero Reanalyze required substantially less training time –12 hours to train versus Rainbow’s 10 days.

“In terms of resources, if you care about how much you have to interact with the environment, the model that MuZero learns actually allows us to learn a task much more efficiently,” DeepMind staff software engineer Julian Schrittwieser told VentureBeat. “Basically, the idea is that you can look back on past experience and then use the model to replan — reanalyze — this data, so that MuZero can repeatedly learn more and more from the same data. This is very important if you want to tackle real-world problems, as they often have very little data.”

Lastly, in an attempt to better understand the role the model played in MuZero, the coauthors focused on Go and Ms. Pac-Man. They compared search in AlphaZero using a perfect model to the performance of search in MuZero using a learned model, and they found that MuZero matched the performance of the perfect model even when undertaking searches larger than those for which it was trained. In fact, with only six or seven simulations per move — fewer than the number of simulations per move that is enough to cover all eight possible actions in Ms. Pac-Man — MuZero learned an effective policy and “improved rapidly.”

With Go, the results showed that MuZero’s playing strength increased by more than 1000 Elo, a measure of a player’s relative skill, as the researchers increased the time per move from one-tenth of a second to 50 seconds. (That’s roughly the difference between a strong amateur player and a top-rated professional player.) This suggests MuZero can generalize between actions and situations and doesn’t need to exhaustively search all possibilities to learn effectively.

Real-world applications

Over the next few months, DeepMind plans to focus on identifying potential commercial applications for MuZero and model-based reinforcement learning systems like it. One might be internet traffic, which Silver notes is dominated by video streaming. (Videos accounted for an estimated 80% of all consumer bandwidth in 2019.) Clips are compressed through the use of codecs, which encode and decode digital data streams, and these codecs have parameters that must be tuned for different categories of video.

“If you can compress videos and make them smaller, you can have massive savings on all internet traffic,” Silver said. “This is something which we can apply our learning algorithms to and have a lot of the characteristics of the real world because you never know what you’re going to see next in the video. This kind of project is just an example where we’re starting to see quite promising initial results.”

Beyond this, DeepMind anticipates MuZero solving problems in real-world scenarios where the characteristics of a particular environment are unknown, like in personalized medicine and search and rescue. That’s not to imply MuZero is without limitations — owing to complexity, it can’t model imperfect information situations where decisions have to be made simultaneously and multiple people must balance possible outcomes when making a decision, like in the board game Diplomacy or the card game Hanabi. (Coincidentally, DeepMind is developing a separate family of algorithms to tackle Diplomacy and setups similar to it.) But Silver believes that even in its current state, MuZero represents a major advancement in the field of AI and machine learning, particularly with regard to reinforcement learning.

“What we’ve done is taken algorithms that were designed to work with perfect knowledge of the rules of the game, taken away the knowledge of the rules, and set this algorithm out there to learn by trial and error so it’s playing a game and experiencing whether it wins or loses,” Silver said. “Despite taking this knowledge away, MuZero learns to achieve superhuman performance as quickly as the original versions of the algorithm that were provided with this perfect knowledge. To me, from a scientific perspective, that’s a real step change — something that allows us to apply these things to a much broader class of real-world problems than we’ve been able to do in the past.”

Source: Read Full Article