Introduction to Reinforcement Learning

Going through lecture series: Introduction to Reinforcement Learning by David Silver

Lecture 1: Introduction to Reinforcement learning

RL vs. the rest of ML

Reward hypothesis

All goals can be described by the maximization of expected cumulative reward.

Formulation

At each time step \(t\):

The history is the sequence of actions, observations, and rewards:

\[ H_t = O_1, R_1, A_1, ..., A_{t-1}, O_t, R_t \]

The state is a function of the history:

\[ S_t = f(H_t) \]

Environment vs. agent states

Fully vs. partially observable environments

Either the environment state is fully observable, \(O_t = S^a_t\), or not \(O_t \neq S^a_t\).

This is the difference between a Markov Decision Process (MDP) and a Partially Observable Markov Decision Process (POMDP).

RL agent components

Not all agents will explicitly model all components, for example:

Model free vs. model based

Lecture 2: Markov Decision Processes

Markov property

A state \(S_t\) is Markov iff $$ [S_{t+1} | S_t] = [S_{t+1} | S_1, … S_t] $$

i.e. the future is independent of the past given the present.

i.e. the state is a sufficient statistic of the future.

State transition matrix

Markov Process/Markov Chain

A Markov Process is a tuple \(\langle\mathcal{S}, \mathcal{P}\rangle\):

Markov Reward Process

A Markov Reward Process is a tuple \(\langle \mathcal{S}, \mathcal{P}, \mathcal{R}, \gamma \rangle\):

Return

The return \(G_t\) is the total discounted reward from time step \(t\):

\[ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t + 3} + ... = \sum_{k=0...\infty} \gamma^k R_{t + k + 1} \]

Why discount factors

Value function

The state value function \(v(s)\) is the expected return starting from \(s\):

\[ v(s) = \mathbb{E}[G_t | S_t = s] \]

Bellman Equation for MRPs

If we regard \(v, \mathcal{P}, \mathcal{R}\) as matrices, we can write:

\[ v = \mathcal{R} + \gamma \mathcal{P} v \]

Solving the Bellman Equation for small MRPs

Markov Decision Process

A Markov Decision Process is a tuple \(\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle\):

Policy

A policy \(\pi\) is a distribution over actions given states:

\[ \pi(a | s) = \mathbb{P}[A_t = a | S_t = s] \]

Relationship to MPs and MRPs

Given an MDP and a policy:

State-value and action-value functions

Bellman Equation for MDPs

We can define \(v\) and \(q\) in terms of each other:

And then we can expand so that they don’t depend on each other:

Optimal value functions

We define optimal policies as:

We define an ordering over policies:

Optimal policy theorem
Bellman optimality equation for optimal value functions
Solving the Bellman optimality equation

Partially Observable Markov Decision Processes (POMDPs)

A POMDP is a tuple \(\langle \mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{P}, \mathcal{R}, \mathcal{Z}, \gamma \rangle\):

Belief states

Lecture 3: Planning by Dynamic Programming

Iterative policy evaluation

Equation

\[ v_{k + 1} = \mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} v_k \]

Policy iteration

Given a policy \(\pi\):

Value iteration

Like policy iteration, but we update the value function (and thus the policy) every iteration. There is no explicit policy.

\[ v_{k+1} = \max_a \mathcal{R}^a + \gamma \mathcal{P}^a v_k \]

Asynchronous dynamic programming

Synchronous vs. asynchronous

In-place

In-place dynamic programming updates each state one-by-one, while synchronous dynamic programming updates all states at once.

Practically, this means that your \(v(s')\) values might have been updated a different number of times - but they are more up-to-date.

Prioritized sweeping

Select the state to update by the magnitude of the Bellman error.

Practically, this requires knowing the predecessor states.

Real-time

Guide the updates by what the agent is currently experiencing.

Lecture 4: Model-Free Prediction

Model-free

No access to the underlying MDP: we don’t know state transitions, rewards, etc.

This is opposed to dynamic programming, which solves a known MDP.

Monte-Carlo Policy Evaluation (MCPE)

First-visit

Every-visit

As first-visit, but \(N\) and \(S\) are updated every time a state is seen.

Incremental updates

Temporal-Difference Policy Evaluation (TDPE)

TD(0)

TD(lambda)

N-step TD
Lambda return
Forward-view
Backward-view
TD(0) and TD(1)

MC vs. TD

Lecture 5: Model-Free Control

On-policy vs. off-policy

Epsilon-greedy policy improvement

Greedy in the Limit with Infinite Exploration (GLIE)

Monte-Carlo control

Sarsa(lambda)

Off-policy learning

Importance sampling

Q-learning

Lecture 6: Value Function Approximation

Value function approximation

Experience replay

Deep Q-Networks.

Uses experience replay and fixed Q-targets.

\[ \mathcal{L}(w_i) = \mathbb{E}_{s,a,r,s' \sim \mathcal{D}} [ (r + \gamma \max_{a'} Q(s', a' ; w^-_i) - Q(s, a; w_i))^2 ] \]

Lecture 7: Policy Gradient

Policy objective functions

How do we score policies?

Likelihood ratio trick

Policy gradient theorem

One-step MDP (illustration)

Monte-Carlo Policy gradient (REINFORCE)

Criticism of REINFORCE

The Monte-Carlo estimates are high variance, which makes learning a policy harder.

Actor-Critic methods

Compatible Function Approximation theorem

The policy gradient is exact in Actor-Critic methods if:

Advantage function