Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:14:28
10 Jun 2021

Value function approximation is a crucial module for policy evaluation in reinforcement learning when the state space is large or continuous. The present paper revisits policy evaluation via temporal-difference (TD) learning from the Gaussian process (GP) perspective. Leveraging random features to approximate the GP prior, an online scalable (OS) approach, termed {OS-GPTD}, is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs. To benchmark the performance of OS-GPTD even in the adversarial setting, where the modeling assumptions are violated, complementary worst-case analyses are performed. The cumulative Bellman error, as well as the long-term reward prediction error, are upper bounded relative to their counterparts from a fixed value function estimator with the entire state-reward trajectory in hindsight. Performance of the novel OS-GPTD is evaluated on two benchmark problems.

Chairs:
Seung-Jun Kim

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00