Researchers have proposed a method for allowing reinforcement learning algorithms to accumulate knowledge while erring on the side of caution. The team, which hails from the University of Toronto, the Vector Institute, and the University of California, Berkeley, claims this approach can achieve competitive performance while incurring lower catastrophic failure rates during training compared to prior methods.
Reinforcement learning is a powerful framework because it allows agents to learn how to make decisions automatically through trial and error. In the real world, the cost of trials — and errors — can be fairly high. For example, a drone that attempts to fly at high speeds might crash and be decommissioned due to physical damage. But learning complex skills without any failures at all is likely impossible.
A number of researchers have tackled the problem of safe exploration, including teams from DeepMind and OpenAI, but most of their approaches require assumptions like knowledge of unsafe states and safe policies obtained after training. By contrast, this newly proposed safe reinforcement learning algorithm only assumes access to a sparse indicator for catastrophic failure. And it trains a conservative safety critic that overestimates the probability of catastrophic failure.
The researchers tested their approach across several simulated environments using an open source platform. One environment was “point agent and car navigation avoiding traps,” where an agent guided by the safe reinforcement learning algorithm had to navigate a maze while avoiding a trap. Another was “Panda push without toppling,” where a robot arm had to push a vertically placed block to a location across a table without it toppling over. In “Panda push within boundary,” the arm had to push a block across the table without moving it outside rectangular lines. And in “Laikago walk without falling,” a quadruped robot had to walk without falling.
The results show the safe reinforcement learning algorithm “demonstrated that the probability of failures is bounded throughout training and provided convergence results showing how ensuring safety does not severely bottleneck task performance,” the researchers wrote in a paper. “We empirically validated our theoretical results and showed that we achieve high task performance while incurring low accidents during training,” they continued. “Although our approach bounds the probability of failure and is general in the sense that it does not assume access [to] any user-specified constraint function, in situations where the task is difficult to solve, for example due to stability concerns of the agent, our approach will fail without additional assumptions. In such situations, some interesting future work directions would be to develop a curriculum of tasks to start with simple tasks, where safety is easier to achieve, and gradually move toward more difficult tasks, such that the learned knowledge from previous tasks is not forgotten.”
Source: Read Full Article