by Jos Martin, MathWorks
30 January 2020
You might be familiar with reinforcement learning as the AI responsible for beating human players in board games like Go and chess. But for businesses, reinforcement learning has the potential to do so much more.
Many engineers, scientists, and researchers want to take advantage of this new and growing technology, but simply don’t know where to start.
If we simplify the concept, at its foundation, reinforcement learning is a type of machine learning that has the potential to solve tough decision-making problems. But to truly understand how it impacts us, we need to answer three key questions:
- What is reinforcement learning, and why should I consider it when solving my problem?
- When is reinforcement learning the right approach?
- What is the workflow I should follow to solve my reinforcement learning problem?
What is reinforcement learning all about?
Reinforcement learning is when a computer learns to perform a task through repeated trial-and-error interactions with a dynamic environment. In this learning approach, the computer makes a succession of decisions that maximize a reward metric for the task without human intervention and without being explicitly programmed to achieve it.
Consider the task of parking a vehicle using an automated driving system. The goal is for the vehicle computer (agent) to park the vehicle in the correct parking spot with the right orientation. The environment is everything outside the agent, including the dynamics of the vehicle, other nearby vehicles, weather conditions, and so on. During training, the agent uses readings from sensors such as cameras, GPS, and lidar (observations) to generate steering, braking, and acceleration commands (actions). To learn how to generate the correct actions from the observations (policy tuning), the agent repeatedly tries to park the vehicle using a trial-and-error process. A reward signal can be provided to evaluate the effectiveness of a trial and to guide the learning process.
Is reinforcement learning the right approach for me?
Despite its benefits, reinforcement learning may not always be the right approach.
For example, reinforcement learning is not sample efficient, meaning a lot of training is essential to achieve acceptable performance. As an example, AlphaGo was trained around the clock over a few days by playing millions of games, amassing thousands of years of human knowledge. Even for relatively simple applications, training time can take anywhere from minutes to hours or days. Finally, setting up the problem correctly can be a headache; numerous design decisions need to be made, which may require a few rounds of changes to get right. These decisions include choosing a suitable architecture for the neural networks, tuning hyperparameters, and shaping the reward signal.
What’s more, a trained deep neural network policy is often treated as a “black box,” meaning that the internal structure of the neural network is so complicated, often consisting of millions of parameters, that to understand it is a near-impossible task. Having to explain and evaluate the decisions made by the network is another challenge. This makes the creation of formal performance guarantees with neural network policies very difficult.
Therefore, if an engineer is working on a time- or safety-critical project, this form of machine learning may not be the best thing to try first.
What does the reinforcement learning workflow look like?
The recommended steps in the reinforcement workflow are as follows:
- Generate the environment: Define the environment within which the agent operates, including the interface between agent and environment. The environment can be either a simulation model or a real physical system.
- Define the reward: Specify the reward signal that the agent uses to measure its performance against the task goals and how to calculate the signal from the environment it sits in.
- Create the agent: Choose a way to represent the policy, for example by using neural networks or lookup tables, and select the training algorithm most appropriate for the task. The policy and the training algorithm together constitute the agent.
- Train and validate the agent: Set up training options (e.g., stopping criteria) and train the agent to tune the policy. Validate the trained policy after the training is complete. The time needed for training can vary from minutes to days depending on the application. For complex applications, undertaking simultaneous training on multiple CPUs, GPUs, and computer clusters can speed up the process.
- Deploy the policy: Deploy the trained policy representation using, for example, generated C/C++ or CUDA code. There is no need to consider agents and training algorithms at this stage because the policy is a standalone decision-making system.
Training an agent using reinforcement learning is an iterative process. Decisions and results in later stages can necessitate that you go back to an earlier stage in the learning workflow.
Irrespective of the final choice of tool, before you decide to implement reinforcement learning, it is crucial to ask whether, given the time and resources you have for this project, reinforcement learning is the best approach.
Jos Martin is senior engineering manager at MathWorks, makers of MATLAB and Simulink software