TechTalks from event: ICML 2011
Robotics and Reinforcement Learning
Conjugate Markov Decision ProcessesMany open problems involve the search for a mapping that is used by an algorithm solving an MDP. Useful mappings are often from the state set to some other set. Examples include representation discovery (a mapping to a feature space) and skill discovery (a mapping to skill termination probabilities). Different mappings result in algorithms achieving varying expected returns. In this paper we present a novel approach to the search for any mapping used by any algorithm attempting to solve an MDP, for that which results in maximum expected return.
Approximate Dynamic Programming for Storage ProblemsStorage problems are an important subclass of stochastic control problems. This paper presents a new method, approximate dynamic programming for storage, to solve storage problems with continuous, convex decision sets. Unlike other solution procedures, ADPS allows math programming to be used to make decisions each time period, even in the presence of large state variables. We test ADPS on the day ahead wind commitment problem with storage.
Apprenticeship Learning About Multiple IntentionsIn this paper, we apply tools from inverse reinforcement learning (IRL) to the problem of learning from (unlabeled) demonstration trajectories of behavior generated by varying ``intentions'' or objectives. We derive an EM approach that clusters observed trajectories by inferring the objectives for each cluster using any of several possible IRL methods, and then uses the constructed clusters to quickly identify the intent of a trajectory. We show that a natural approach to IRL---a gradient ascent method that modifies reward parameters to maximize the likelihood of the observed trajectories---is successful at quickly identifying unknown reward functions. We demonstrate these ideas in the context of apprenticeship learning by acquiring the preferences of a human driver in a simple highway car simulator.
Classification-based Policy Iteration with a CriticIn this paper, we study the effect of adding a value function approximation component (critic) to rollout classification-based policy iteration (RCPI) algorithms. The idea is to use a critic to approximate the return after we truncate the rollout trajectories. This allows us to control the bias and variance of the rollout estimates of the action-value function. Therefore, the introduction of a critic can improve the accuracy of the rollout estimates, and as a result, enhance the performance of the RCPI algorithm. We present a new RCPI algorithm, called direct policy iteration with critic (DPI-Critic), and provide its finite-sample analysis when the critic is based on the LSTD method. We empirically evaluate the performance of DPI-Critic and compare it with DPI and LSPI in two benchmark reinforcement learning problems.