Supervisors: Elmar Rueckert, Prof. Dr. Jan Peters
For controlling robots with many actuators, most stochastic optimal control algorithms use approximations of the system dynamics and of the cost function (e.g., using linearizations and Taylor expansions). These approximations are typically only locally correct, which might cause instabilities in the greedy policy updates, lead to oscillations or the algorithms diverge. To overcome these drawbacks, we add a regularization term to the cost function that punishes large policy update steps in the trajectory optimization procedure. In the first part of this thesis, we applied this concept to the Approximate Inference Control method (AICO), where the resulting algorithm guarantees convergence for uninformative initial solutions without complex hand-tuning of learning rates. We evaluated our new algorithm on two simulated robotic platforms. A robot arm with five joints was used for reaching multiple targets while keeping the roll angle constant. On the humanoid robot Nao, we showed how complex skills like reaching and balancing can be inferred from desired center of gravity or end effector coordinates.
In these tasks we assumed a known forward dynamic model. Typically, inaccurate model predictions have catastrophic effects on the numerical stability of SOC methods. In particular, if the model predictions are poor, the SOC method should not further explore but collect more data around the current trajectory. Therefore, we investigated in the second part of this thesis how to learn such a forward dynamics model with Gaussian processes in parallel to movement planning. The trade off between exploration and exploitation can be regularized with the model uncertainty, which was introduced as an additional objective in AICO. We evaluated the simultaneous model learning and movement planning approach on a simple pendulum toy task. We achieved safe planning and stable convergence even with inaccurate learned models. The planned trajectories, torques and end effector positions converges to local optimal solutions during the learning process. The model prediction error of the forward dynamics model converges to zero.