foundations of computational agents
One of the main problems with the real-world deployment of reinforcement learning is that of accidents: unintended or harmful behavior, which may be the result of poor design.
Amodei et al. [2016] gives five challenges to real-world deployment of reinforcement learning agents.
Avoiding negative side-effects. A side-effect is something unintended that happens as a consequence of an agent’s action. For example, a delivery robot should not spill coffee on the carpet or run over cats, but it may be unavoidable to run over ants if delivering outside. If you specify to be as quick as possible, the optimal course of action – based on its specified rewards – may be to have some of these side-effects. There are too many possible ways the robot could go wrong to specify them all for each task.
Avoiding reward hacking. Reward hacking occurs when an agent optimizes a reward by doing something that was not the intention of whoever specified the reward. The delivery robot could put all of the mail in the garbage in order to have no undelivered mail, or if its goal is to have no unfulfilled tasks it could hide so that it cannot be given new tasks. A cleaning robot that is rewarded for cleaning messes might be able to get most reward by creating messes it can clean.
Scalable oversight occurs when a human is giving rewards to ensure an agent is doing the right thing, but a human cannot give unlimited feedback. When the agent does something unintended, as in the examples above, it needs to be corrected. However, such oversight cannot be continual; a human does not want to continually evaluate a deployed robot.
Safe exploration. Even though the optimal way to deliver coffee might be safe, while exploring it might try actions that are dangerous, such as hitting people or throwing coffee, only to be given a negative reward. The actions of the agent even when exploring need to be safe; it should be able to explore but only within a safety-constrained environment.
Robustness to distributional shift. The environment during deployment will undoubtedly be different from the environment the agent is trained on. Environments change in time, and agents should be able to take changes in stride. The probability distribution of different outcomes can shift, and an agent should be robust to these changes or be able to quickly adapt to them.
Amodei et al. [2016] provide references to other researchers who have considered these issues.