Reinforcement learning control of robot manipulator

Since the establishment of robotics in industrial applications, industrial robot programming involves the repetitive and time-consuming process of manually specifying a fixed trajectory, resulting in machine idle time in production and the necessity of completely reprogramming the robot for different tasks. The increasing number of robotics applications in unstructured environments requires not only intelligent but also reactive controllers due to the unpredictability of the environment and safety measures, respectively. This paper presents a comparative analysis of two classes of Reinforcement Learning algorithms, value iteration (Q-Learning/DQN) and policy iteration (REINFORCE), applied to the discretized task of positioning a robotic manipulator in an obstacle-filled simulated environment, with no previous knowledge of the obstacles’ positions or of the robot arm dynamics. The agent’s performance and algorithm convergence are analyzed under different reward functions and on four increasingly complex test projects: 1-Degree of Freedom (DOF) robot, 2-DOF robot, Kuka KR16 Industrial robot, Kuka KR16 Industrial robot with random setpoint/obstacle placement. The DQN algorithm presented significantly better performance and reduced training time across all test projects, and the third reward function generated better agents for both algorithms.


Introduction
The diversity of modern industrial robotics applications requires the emergence of robots with different degrees of autonomy, appropriate for executing different tasks, such as welding, machining, assembly and cargo handling. The development of more sophisticated sensors, along with the increasing computational capacity of controllers and advances in the fields of computational vision and artificial intelligence has shifted the field of robotic manipulators: Repetitive and fixed pre-programmed routines have given way to flexible and more reactive controllers, capable of dynamically identifying the orientation of workpieces or learning optimal routines directly from data (Rosen, ). This tendency is not limited to robotics. Recent developments in Artificial Intelligence, namely Reinforcement Learning, have been dedicated to training robust models for a wide variety of applications, from economics and finance (Charpentier et al., ) to healthcare systems (Coronato et al., ). Reinforcement Learning is an increasingly popular field in AI in which an intelligent agent is trained to perform a specific task while maximizing a reward signal (Sutton and Barto, ). This work aims to obtain the optimal reward function formulation and algorithm choice for the task of positioning a simulated KUKA-KR industrial robot while avoiding both known and unknown obstacles. The agents are trained through two different reinforcement learning algorithms over successive interactions with an obstacle-filled simulated environment. For training, it is only necessary to provide the initial specification of a reward function, which represents the quality of actions taken by the agent and guides its exploration. After training, the agent is capable of positioning the robot's end effector in generic positions while avoiding obstacle collision-based solely on sensor data from its current pose.
The main contributions of this work are the development of a Reinforcement Learning (RL) framework for robotics applications in MATLAB, including training and visualization modules, and a comparative analysis of standard RL algorithms: episodic REINFORCE and DQN. Different reward functions are tested and the agent's performance is evaluated. The entire project is open source, and all codes can be found in Github Repository

State of the Art
The recent development of Reinforcement Learning means that its practical applications are currently mostly restricted to simulation environments for testing and performance validation, such as OpenAI gym (Brockman et al., ). There are several Robotic related tasks in OpenAI gym which utilize a physics engine for simulation and collision detection known as MuJoCo (Todorov et al., ). The concept of state and action space exploring inherently requires large amounts of data to be processed and training directly in the real world may lead to accidents. Simulation-based training solves both issues by providing a risk-free environment in which the control agent can acquire faster experience.
Several authors have tried to train RL agents in simulated environments and transfer the resulting model directly to real-world applications. James and Johns ( ) were partially successful in the simulation-based training and subsequent model transferring of a DQN agent for controlling a seven DOF robot in a cube locating and lifting task. The work environment was structured in a way that maximizes the similarity with the simulation environment in order to enable model transfer. The resulting RL agent was able to correctly locate the cube when applied directly to the real-world robot, but subtle differences in the environment prevented it from grabbing and lifting it.
One of the biggest challenges associated with implementing Reinforcement Learning in industrial robotics is Low Sampling Efficiency. Most RL algorithms typically require a large volume of training data before optimal policies can be learned, and the generation of data in real-world settings is often impractical, as it requires a long idle time. To work around this problem, hand-crafted specific initial policies that capture the desired behavior are often used. However, this approach conflicts with the main advantage of RL, i.e., the autonomous learning of various behaviors with minimal human intervention. Gu et al. ( ) present an innovative architecture of the DDPG (Deep Deterministic Policy Gradient) and NAF (Normalized Advantage Function) algorithms in which multiple robots interact with the environment, gain experience according to its current action politics and send data asynchronously to a server that samples transitions and trains a DQN network. This architecture allows the robot to continue interacting with the environment and collecting state transitions while the DQN parameters are updated, promoting scalability for the inclusion of new robots. The authors validated the proposed architecture in learning the task of opening a door by manipulating robots with seven degrees of freedom, and the action policy was obtained without previous demonstrations. Chen et al. ( ) used a combination of the Distributed Proximal Policy Optimization (DPPO) and DQN algorithms to solve the similar task of positioning a simulateddimensional robot manipulator while avoiding multiple obstacles. The authors showed that the two-step solution of using DPPO to perform obstacle avoidance while a DQN agent performs navigation resulted in better performance than either algorithm individually.
A major issue in path planning tasks for robotic manipulators in unstructured obstacle-filled environments is the blindness of exploration. Common sparse functions that reward an agent's action only when the proposed task is successfully completed, and provide zero information otherwise, can lead to a highly inefficient learning process. In order to solve this, Xie et al. ( ) have developed an azimuth dense reward function that provides feedback to the agent regularly, reducing the number of training epochs and improving learning efficiency.

Problem Definition
A typical Reinforcement Learning framework consists of three interacting modules: environment, interpreter and agent. The environment's current condition is captured by an interpreter, which encodes it at time t as a state s t and assigns a reward value r t . The agent, based on the state and reward received by the interpreter, takes an action a t , which leads to a state transition according to the system dynamics.
The application of this framework to the control of a  In order to determine the best algorithm, reward function and hyperparameters, simplified versions of the problem were studied under two main classes of algorithms: Iteration over policy function π θ and iteration over value function Q θ . Fig. shows the four test projects considered and the two algorithms implemented for each: Episodic REINFORCE and Q-Learning/DQN.
In the first two simplified projects ( and DOF robots), the reduced dimension of the state S and action A spaces allowed the use of a Q-Learning algorithm known as Q-Tables, in which every possible state-action combination is directly mapped to Q (s, a), which is a function of state s and action a, given by a table. However, due to the increased dimensions of the last two projects, the more sophisticated algorithm DQN was implemented, in which a Feedforward Neural Network approximates the stateaction value function Q (s, a).
The software chosen was MATLAB because of the support libraries and functions for robotics simulations. This saves time programming both the visualization and the dynamics computation, besides having all the necessary tools of reinforcement learning and neural network implementations. While other programming languages such as Python offer significantly more support for machine learning applications in libraries like Tensorflow and Keras, the available Python libraries for robot kinematics present limited functionality compared to MATLAB.

Implemented Algorithms
Reinforcement Learning algorithms can be divided in two major classes: Policy-based and Value-based. The former represents the agent's policy directly and performs updates on it according to the reward obtained by taking different actions in different states. The latter learns a state-action value function Q (s, a) instead from which the agent's actions are derived.
In this work, a baseline Deep Reinforcement Learning algorithm of each class was implemented: The Policybased REINFORCE and the Value-based DQN. However, the developed framework allows for other RL algorithms compatible with discrete action spaces to be implemented, such as A C (Mnih et al., ) and PPO (Schulman et al., ).

. First Algorithm: Episodic REINFORCE
The first implemented algorithm is the classic action policy iteration algorithm, episodic REINFORCE proposed by Williams ( ), adapted to the manipulation robot positioning problem according to the pseudocode below. REINFORCE consists in the parameterization of the action policy function π(s) as π θ (s) using any function approximation method, such as neural networks or highdegree polynomials, and training by successive updates of the parameters θ in order to maximize the Performance function J(π θ ), which represents the quality of policy π θ . Let τ = {s , a , r , s , a , r , ..., s T-, a T-, r T } be a trajectory generated by a generic policy π θ , the performance function J(π θ ) can be defined as the expected value of discounted rewards over the trajectory (Eq. ( )).
where r(τ ) = T t= γ tr t = Vπ(s ) is equivalent to the value of the initial state Vπ(s ) according to policy π θ . Since knowledge of the environment and reward is gathered through environmental interaction, the gradient of the performance function ∇J(θ) must be approximated by a sufficient number of trajectories N (Eq. ( )).
where γ is known as the discount factor and the returns G t = T-t-k= γ k r t+k+ = r t+ + γr t+ + ... + γ T-tr T are defined as the sum of discounted rewards from instant t onward. Historically used to bound the sum of expected rewards of infinite horizon models, the discount factor γ can be interpreted as an interest rate which prioritizes actions with higher immediate rewards while also taking into account future rewards (Kaelbling et al., ).

Algorithm : Episodic REINFORCE
· Initialize Robot, setpoint, obstacle, initial state s and action space A; · Initialize Hyperparameters (bonus and penalties, network size, number of timesteps, trajectories and epochs, discount factor γ and learning rate α); · Initialize data structure to store epochs; · Initialize parameterized action policy π θ randomly; Generate N trajectories {τn} N n= from action policy Apply Gradient Ascent Method on J(θ) to get πep: Second Algorithm: DQN DQN (Deep Q Network) can be seen as a generalization of the simple Q-Learning algorithm known as Q-tables. Rather than directly mapping each state-action pair to a value Q(s, a) and performing successive iterations on the resulting table, DQN performs the parameterization of the state-action value function Q θ (s, a) as a weighted neural network (Mnih et al., ). The network is initialized arbitrarily with random weights θ, which are updated successively as state transitions (s, a, r, s ) are observed by the agent. The DQN network is trained to satisfy the Bellman Equation (Eq. ( ) - (Bellman, )): ; end store expected output q = Q(s, a) and target y ; end Applies Gradient Descent to minimize cost function given by which associates the value of a state-action pair to the maximum value of subsequent state-action pairs. DQN can be seen as a Supervised Learning Algorithm in which the target is non-stationary, as it depends on the Q θ (s, a) function itself, except for terminal states, in which the target is simply the reward r(s t , a t ). This is one of the major difficulties associated with this method and causes its convergence to depend on sufficient exploration of different actions across the entire state space. However, given sufficient exploration, the algorithm is proved to converge to the optimal state-action value function Q * (s, a).
In order to improve numerical conditioning and allow for faster convergence, a technique known as Prioritized Experience Replay (Schaul et al., ) is implemented, where state transitions are stored in a experience buffer and sampled randomly, while terminal transitions are always sampled. State transition tuples are defined as (s, a, r, s , bool term ) where s is the system's current state, a is the action taken by the agent, r the reward obtained and s is the subsequent state, in addition, a Boolean variable bool term indicates if the state is terminal.

Reward Functions
Reward function engineering is critical in reinforcement learning applications. The reward function determines the quality of actions taken by the agent and influences not only the policies it is capable of learning but also the algorithm's convergence. As a result, there is an increasing effort in recent research to optimize reward functions for different tasks.
Over the test projects, three different reward functions are implemented. In the first two test projects, agents trained with one reward function showed significantly better performance than the others. As a result, the Kuka projects focused on the implementation of this reward function. The following sections detail their mathematical implementations, key insights and intuition.

. First Reward Function: Absolute Distances
The first reward function (Eq. ( )) considered is inspired in the potential field method for path planning of mobile robots. The reward function depends on the euclidean distances between the end effector, the obstacle and the goal, similarly to the reward function used in Sangiovanni et al. ( ), which also applies a distance based reward function to the task of training a robotic manipulator for positioning while avoiding obstacles. is taken and psp and p obs are the setpoint and obstacle positions respectively. The remaining terms represent bonuses or penalties given to the agent based on the desired behavior: B goal is a bonus given when the desired object is reached, P collision is a penalty given when either the table or the red obstacle is hit and P joint boundary is a penalty given when one of the robot's joint's limit is reached. .

Second Reward Function: Discrete under Approximation or Distancing
In order to correct problems observed in the first function, such as high magnitude and non-zero average value over all the possible actions at a given state, a second function is tested. The second reward function (Eq. ( )) is dependent on the relative approximation or distancing between the end effector, the obstacle and the goal. where The discrete penalties and rewards given in case of collision are the same as defined in r (s, a), given in Eq. ( ). .

Third Reward Function: Projection of Displacement Vector
Finally, the third reward function (Eq. ( )) is similar to the second, but the terms r setpoint and r obstacle are no longer limited to -, and , but are given by the projection of the displacement vector n ef->ef' in the directions that point to the goal n ef->setpoint and to the obstacle n ef->obstacle .   obstacle. Finally, the reward is given by the corresponding projections.
Overall, agents trained with the reward function r showed better performance in comparison to those  The REINFORCE agent trained with reward function r presented a significantly better performance in terms of convergence time and stability (Fig. a) while the Q-Learning agent showed more frequent drops in its learning curve during epochs in which a collision with the obstacle occurred (Fig. b), which is possibly due to r 's priority to direct paths to the goal combined with a goal-obstacle configuration in which a direct path is obstructed.

Results
From the partial results obtained during training of a and -DOF robot, agents trained with the third reward function presented significantly better performance. As a result, only r is implemented in the last two projects, which focus on a comparative analysis between both classes of algorithms. In this section, the training frameworks and detailed results obtained from applying both classes of algorithms to the Kuka test projects are presented. The results are followed by a brief comparative analysis and summarized at the end of the section.
A side-by-side comparison of both algorithms in increasingly more sophisticated applications is valuable as it allows us to focus on the algorithms' foundations, eliminating sources of instability or non-convergence and comparing both in identical settings. Another advantage is the possibility to exploit modular, object-oriented program design, since most functions are shared by the various test projects and can be easily adapted to other applications. As shown in Fig. , the test projects are characterized by the simplifications considered: The first two implement and DOF robots, while the last two implement the -DOF KUKA KR robot, but with initially fixed and then generic configurations of goal and obstacle.

. Test Project : KUKA KR -Fixed Configuration
After comparing both the reward function and the algorithm on robots with a reduced number of degrees of freedom, a -dimensional simulation and visualization environment for the KUKA-KR robot was implemented on Matlab by loading RigidBodyTree object representation of the robot based on its urdf file and stl meshes.
In this project, the agent's task is to control the robot's first five degrees of freedom to position its end effector on the fixed goal position (green) while avoiding collision with a known obstacle (red) and the table, which is unknown and is only detectable through interaction. Due to overall better performance observed on agents trained with the third reward function (Eq. ( )) on both algorithms, the following projects implement only r as the reward function and focus on a comparative analysis between algorithms as well as on techniques to overcome the dimensionality issue on real robotics applications.
The State Space is now given by Eq. ( ): Similarly to previous test projects, the State Space is the combination of possible angular positions for each controllable rotating joint S i and all possible Cartesian positions for the goal and the obstacle in threedimensional space R . Table indicates Kuka KR 's  joint limits and the implemented limits give the table  workspace. The Action Space is: . . Episodic REINFORCE The algorithm's generic formulation allowed for relatively simple adaptation to the new project. The policy function π θ (s, a) neural network complexity was increased in order to allow for the abstraction of more complex policies. A three-layer feedforward network with trainable parameters was implemented. Similarly to previous test projects, the direct approximation and training of the policy function π θ (a|s) yielded a smooth, monotonically increasing average reward curve ( Fig. ). As opposed to value iteration algorithms, which search for the optimal state-action value function Q θ (s, a) and derive the optimal policy by taking the action of most value at each state.
In order to study the agent's increasing preference for optimal actions during training, the probability  distribution of actions in A given the initial state was plotted for epochs , , and ( Fig. ).
. . DQN Due to exponentially increasing state-action space dimension as the number of degrees of freedom increases, a Q-table algorithm is impracticable as a result of memory and computation limitations. In order to overcome the dimensionality issue, the state-action value function Q θ (s, a) was represented as a Multi-layer Perceptron (MLP). The algorithm's formulation, detailed in Section . , consists in applying gradient descent in order to minimize the mean squared error between the network's current output Q θ (s, a) and the target r(s, a) + γ max a ∈A Q θep (s , a ), where s denotes the state reached after action a is executed in state s. Table summarizes the algorithm-specific hyperparameters implemented. a direct path to the goal is blocked by an unknown object, a wall was placed between the effector's initial position and the goal. Similarly to table collision, wall collision is incorporated into the state transition function and terminates a trajectory if the robot's end effector is sufficiently close to the wall. A negative reward of P collision is given during collision and the wall's position can only be learned through environmental interaction. Fig. illustrates the optimal trajectory found by the agent and the corresponding learning curve during training. As expected, an increased number of epochs was necessary for the abstraction of more complex behavior, but the DQN agent was able to dodge the wall correctly with no algorithmic changes.

. Test Project : KUKA KR -Generic Configurations
The main advantage of adaptive learning applied to the control of industrial robots is the flexibility in unexpected scenarios, the scalability provided by training over time and the abstraction of complex and often nonintuitive policies with minimal human intervention. In order to investigate both agents' capability of learning an efficient positioning task for objects randomly located on the workspace, both the goal's (green) and the known obstacle's (red) positions were changed randomly during training. A subspace W ⊂ R of the robot's work volume, defined as W = {(x, y, z) ∈ R | . < x < . , -. < y < . , . < z < . }, was chosen for the possible goal and obstacle positions. During testing, planar goal-obstacle configurations often did not require the agent to avoid the obstacle, as a direct path to the goal was frequently present. In order to test the RL agent on more challenging scenarios, a three-dimensional volume of possible goal obstacle configurations was chosen, giving the impression that some objects are floating.

. . Episodic REINFORCE
A REINFORCE agent with π θ (a|s) policy network architecture equal to Test Project was implemented and trained. However, there was no noticeable increase in performance or convergence to the optimal policy, as  shown in Fig. c. The agent's performance presented an undesired high sensitivity to goal-obstacle configuration, performing the positioning task correctly on specific configurations (Fig.  a) and incorrectly on others (Fig. b). Moreover, proximity between the goal and obstacle resulted in poor performance and an increased chance of table collision.

Comparative Analysis of Algorithms
In order to compare both classes of algorithms, three main criteria were analyzed: convergence rate over the test projects, execution time and smooth increase of average reward. The policy-iteration algorithm REINFORCE outperformed DQN in the latter, while the value-iteration algorithm showed better execution time and convergence results.

Conclusion
The application of newly developed methods, especially in consolidated industries where sensitive operations require that safety conditions must be met, is subject to an extensive research in simulated settings and controlled environments. Reinforcement Learning is a relatively new field with promising results in control and game theory.
The main contributions of this work are the evaluation of two classes of RL algorithms applied to a typical industrial robotics task and the development of a modular simulation architecture that allows for simplicity in further investigation of similar problems. We also present a new reward function formulation based on the projection of the end effector's displacement, which significantly improved the agent's performance on both algorithms.
A comparative analysis of both classes of algorithms on increasingly complex environments also highlighted their main limitations and points to improve future research: sensitivity to reward function, state-action space exponential increase in dimensions, low sample efficiency and consequently high training time. Reward function engineering is where human expert analysis is fundamental, and the dimensionality issue is often overcome by algorithmic changes, such as the replacement of Q-tables with DQN or by modeling the action space A as continuous and having the policy network π θ (a|s) output allow for the mapping onto continuous actions, commonly done in algorithms such as REINFORCE, DDPG and Actor-Critic (Sutton and Barto, ). The non-convergent behavior obtained by the REINFORCE agent on the last project can be explained by common limitations associated with policy iteration algorithms in general, such as high sensitivity to learning rate and exploratory variance (Kormushev et al., ). DQN's overall better performance in shorter training periods is possibly due to higher frequency network updates and the implementation of an experience replay from which state transitions are randomly sampled (Lin, ).