Introduction to Reinforcement Learning

Going through lecture series: Introduction to Reinforcement Learning by David Silver

Lecture 1: Introduction to Reinforcement learning

RL vs. the rest of ML

There is no supervisor, only a reward.
Feedback is delayed.
Time steps are non-IDD.
Observations and actions are non-IDD (actions affect observations).

Reward hypothesis

All goals can be described by the maximization of expected cumulative reward.

Formulation

At each time step $t$:

The agent executes action $A_t$.
The environment emits observation $O_t$.
The environment emits reward $R_t$.

The history is the sequence of actions, observations, and rewards:

\[ H_t = O_1, R_1, A_1, ..., A_{t-1}, O_t, R_t \]

The state is a function of the history:

\[ S_t = f(H_t) \]

Environment vs. agent states

The environment state $S^e_t$ is the private representation.
- e.g. the internal memory of the game.
- It may contain irrelevant information.
The agent state $S^a_t$ is the agent’s internal representation.
- Derived from the observations, $S^a_t = f(H_t)$

Fully vs. partially observable environments

Either the environment state is fully observable, $O_t = S^a_t$, or not $O_t \neq S^a_t$.

This is the difference between a Markov Decision Process (MDP) and a Partially Observable Markov Decision Process (POMDP).

RL agent components

The policy defines the agent’s behavior.
- Maps state to action $a_{t+1} = \pi(s_t)$.
- Or stochastically: $\pi(a_{t+1} | s_t) = \mathbb{P}[A_{t} = a | S_t = s]$.
The value function predicts the future reward.
- $v_{\pi}(s) = \mathbb{E}_{\pi}[R_{t+1} + \gamma R_{t + 1} + \gamma^2 R_{t + 1} + ....] | S_t = s$
- Can also be conditioned on an action to be taken at $S_t$.
The model of the environment.
- Predicts what the environment will do next.
- $\mathcal{P}$ predicts next state, $\mathcal{R}$ predicts next reward.
- $\mathcal{P}^a_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s, A_t = a]$
- $\mathcal{R}^a_{s} = \mathbb{P}[R_{t+1} = r | S_t = s, A_t = a]$

Not all agents will explicitly model all components, for example:

Value based agents infer their policy from their value function.
Policy based agents will learn the policy directly, and not implement a value function.
Actor-critic agents will learn both the value function and the policy.

Model free vs. model based

Model free agents have a policy and/or a value function.
Model based agents have a model (and could have policy and value function too).

Lecture 2: Markov Decision Processes

Markov property

A state $S_t$ is Markov iff $$ [S_{t+1} | S_t] = [S_{t+1} | S_1, … S_t] $$

i.e. the future is independent of the past given the present.

i.e. the state is a sufficient statistic of the future.

State transition matrix

Given a current state $s$, get the probability of the next state $s'$.
$\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s]$
You can represent $\mathcal{P}$ as a $(n \times n)$ matrix where $n$ is the number of states, and each row sums to 1.

Markov Process/Markov Chain

A Markov Process is a tuple $\langle\mathcal{S}, \mathcal{P}\rangle$:

$\mathcal{S}$ is a finite set of states.
$\mathcal{P}$ is a state transition probability matrix.
- $\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s]$

Markov Reward Process

A Markov Reward Process is a tuple $\langle \mathcal{S}, \mathcal{P}, \mathcal{R}, \gamma \rangle$:

$\mathcal{S}$ is a finite set of states.
$\mathcal{P}$ is a state transition probability matrix.
$\mathcal{R}$ is a reward function
- $\mathcal{R}_s = \mathbb{E}[R_{t+1} | S_t = s]$
$\gamma$ is a discount factor.
- $\gamma \in [0, 1]$
- $\gamma = 0$ leads to “myopic” evaluation.
- $\gamma = 1$ leads to “far-sighted” evaluation.

Return

The return $G_t$ is the total discounted reward from time step $t$:

\[ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t + 3} + ... = \sum_{k=0...\infty} \gamma^k R_{t + k + 1} \]

Why discount factors

Mathematically convenient discount functions.
Avoids infinite cyclic returns.
Adds uncertainty about the future.

Value function

The state value function $v(s)$ is the expected return starting from $s$:

\[ v(s) = \mathbb{E}[G_t | S_t = s] \]

Bellman Equation for MRPs

$v(s) = \mathbb{E}[G_t | S_t = s]$
$v(s) = \mathbb{E}[R_t + \gamma G_{t + 1} | S_t = s]$
$v(s) = \mathbb{E}[R_t + \gamma v(S_{t+1}) | S_t = s]$
$v(s) = \mathcal{R}_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}v(s')$

If we regard $v, \mathcal{P}, \mathcal{R}$ as matrices, we can write:

\[ v = \mathcal{R} + \gamma \mathcal{P} v \]

Solving the Bellman Equation for small MRPs

$v = \mathcal{R} + \gamma \mathcal{P} v$
$v - \gamma \mathcal{P} v = \mathcal{R}$
$(I - \gamma \mathcal{P}) v = \mathcal{R}$
$v = (I - \gamma \mathcal{P})^{-1} \mathcal{R}$

Markov Decision Process

A Markov Decision Process is a tuple $\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle$:

$\mathcal{S}$ is a finite set of states.
$\mathcal{A}$ is a finite set of actions.
$\mathcal{P}$ is a state transition probability matrix.
- $\mathcal{P}^a_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s, A_t = a]$
$\mathcal{R}$ is a reward function
- $\mathcal{R}^a_s = \mathbb{E}[R_{t+1} | S_t = s, A_t = a]$
$\gamma$ is a discount factor.

Policy

A policy $\pi$ is a distribution over actions given states:

\[ \pi(a | s) = \mathbb{P}[A_t = a | S_t = s] \]

Relationship to MPs and MRPs

Given an MDP and a policy:

The state sequence $S_1, S_2, ...$ is a Markov Process.
The state and reward sequence $S_1, R_1, S_2, R_2, ...$ is a Markov Reward Process.
Where the reward distribution & state transition distribution are averaged over the policy’s distribution.

State-value and action-value functions

The state-value function:
- $v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s]$
The action-value function:
- $q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a]$

Bellman Equation for MDPs

We can define $v$ and $q$ in terms of each other:

$v(s) = \sum_{a} \pi(a | s) q(s, a)$
$q(s, a) = \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v(s')$

And then we can expand so that they don’t depend on each other:

$v(s) = \sum_{a} \pi(a | s) (\mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v(s'))$
$q(s, a) = \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} \sum_{a} \pi(a | s) q(s, a)$

Optimal value functions

We define optimal policies as:

$v_*(s) = \max_{\pi} v_{\pi}(s)$
$q_*(s, a) = \max_{\pi} q_{\pi}(s, a)$

We define an ordering over policies:

$\pi \geq \pi'$ if $v_{\pi}(s) \geq v_{\pi'}(s), \forall s$

Optimal policy theorem

There exists an optimal policy $\pi_*$ that is better than all other policies $\pi_* \geq \pi, \forall \pi$
All optimal policies achieve the optimal state-value function $v_*$.
All optimal policies achieve the optimal action-value function $q_*$.
We can derive an optimal policy from $v_*$ or $q_*$.

Bellman optimality equation for optimal value functions

$v_*(s) = \max_a q_*(s, a)$
$q_*(s, a) = \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v_*(s')$
$v_*(s) = \max_a \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v_*(s')$
$q_*(s, a) = \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} \max_a q_*(s', a)$

Solving the Bellman optimality equation

Non-linear, so no closed-form solution in general.
We can apply iterative solutions: value iteration, policy iteration, Q-learning, SARSA…

Partially Observable Markov Decision Processes (POMDPs)

A POMDP is a tuple $\langle \mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{P}, \mathcal{R}, \mathcal{Z}, \gamma \rangle$:

Equivalent to MDP for $\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma$.
$\mathcal{O}$ is a finite set of observations.
$\mathcal{Z}$ is an observation function.
- $\mathcal{Z}^a_{s',o} = \mathbb{P}[O_{t+1} = o | S_{t+1} = s', A_t = a]$

Belief states

The history is $H_t = O_1, R_1, A_1, O_2, ..., R_t$.
The belief state $b(h)$ is a probability distribution over states conditioned on $h$:
- $b(h) = \mathbb{P}[S_t | h]$

Lecture 3: Planning by Dynamic Programming

Iterative policy evaluation

Problem: Evaluate a policy $\pi$.
Solution: Iteratively apply Bellman equation.
Method:
- At each iteration $k + 1$.
- For all states $s \in \mathcal{S}$.
- Update $v_{k + 1}(s)$ using $v_k(s')$, where $s'$ is the successor state of $s$.
This uses synchronous backups.
This is proven to converge.

Equation

\[ v_{k + 1} = \mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} v_k \]

Policy iteration

Given a policy $\pi$:

Evaluate the policy using iterative policy evaluation.
Improve the policy by acting greedily:
- $\pi' = greedy(v_{\pi})$
- $\pi'(s) = argmax_a q_{\pi}(s, a)$
- $\pi'(s) = argmax_a \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v_{\pi}(s')$

Value iteration

Like policy iteration, but we update the value function (and thus the policy) every iteration. There is no explicit policy.

\[ v_{k+1} = \max_a \mathcal{R}^a + \gamma \mathcal{P}^a v_k \]

Asynchronous dynamic programming

Synchronous vs. asynchronous

Synchronous: All states are backed up in parallel.
Asynchronous: States are backed up individually.
- This reduces computation.
- This is still guaranteed to converge if all states continue to be selected.

In-place

In-place dynamic programming updates each state one-by-one, while synchronous dynamic programming updates all states at once.

Practically, this means that your $v(s')$ values might have been updated a different number of times - but they are more up-to-date.

Prioritized sweeping

Select the state to update by the magnitude of the Bellman error.

Practically, this requires knowing the predecessor states.

Real-time

Guide the updates by what the agent is currently experiencing.

After each time step $S_t, A_t, R_{t + 1}$.
$v(S_t) = \max_{a} \mathcal{R}^a_{S_t} + \gamma \sum_{s'} \mathcal{P}^a_{S_ts'} v(s')$
Note that we’re still iterating over all actions and successor states.

Lecture 4: Model-Free Prediction

Model-free

No access to the underlying MDP: we don’t know state transitions, rewards, etc.

This is opposed to dynamic programming, which solves a known MDP.

Monte-Carlo Policy Evaluation (MCPE)

Learn $v_{\pi}$ from episodes of experience under $\pi$.
- $S_1, A_1, R_2, ... S_k \sim \pi$
Estimate the empirical mean return, rather than the expected mean return.
- $v(s) = \frac{1}{n} \sum_n \text{samples of values for } s$
No bootstrapping: Learns from complete episodes, so all episodes must terminate.

First-visit

The first time $s$ is visited in an episode:
- Increment counter $N(s) \leftarrow N(s) + 1$.
- Update sum $S(s) \leftarrow S(s) + G_t$.
  - (where $G_t$ is the observed cumulative discounted reward for the rest of the episode)
Value is $V(s) = S(s) / N(s)$
$V(s) \to v_{\pi}(s)$ as $N(s) \to \infty$

Every-visit

As first-visit, but $N$ and $S$ are updated every time a state is seen.

Incremental updates

Instead of storing $S$ and $N$ functions, we can store $V(s)$ directly:
We can reframe $S(s)$ as $V(s)$
- $N(s) \leftarrow N(s) + 1$
- $V(s) \leftarrow V(s) + \frac{1}{N(s)}(G_t - V(s))$
In non-stationary problems, we can tweak this so track a running mean & forget old episodes:
- $V(s) \leftarrow V(s) + \alpha(G_t - V(s))$
- Where $\alpha$ is a hyperparameter.

Temporal-Difference Policy Evaluation (TDPE)

Like MCPE.
But learns from incomplete episodes by bootstrapping.
Updates a guess towards a guess.

TD(0)

$R_{t+1} + \gamma V(S_{t+1})$ is the TD target.
$\delta_t = (R_{t+1} + \gamma V(S_{t+1})) - V(S_t)$ is the TD error.
$V(S_t) \leftarrow V(S_t) + \alpha \delta_t$

TD(lambda)

N-step TD

Define an $n$-step return:
- $G^{(n)}_t = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})$
$n$-step TD learning:
- $V(S_t) \leftarrow V(S_t) + \alpha (G^{(n)}_t - V(S_{t + n}))$

Lambda return

Average over all $n$-step returns $G^{(n)}_t$.
Use weighting $(1 - \lambda)\lambda^{n-1}$
- The sum of the weights is one.
$G^{\lambda}_t = (1 - \lambda) \sum_n \lambda^{n-1} G^{(n)}_t$

Forward-view

$V(S_t) \leftarrow V(S_t) + \alpha (G^{\lambda}_t - V(S_t))$
Looks to the future to compute $G^{\lambda}_t$
Like MC, requires complete episodes.

Backward-view

Same theory as forward-view, but actually practical to implement.
Update online, every step, from incomplete episodes.
Keep an eligibility trace for every state $s$.
- $E_0(s) = 0$
- $E_t(s) = \gamma \lambda E_{t-1}(s) + \textbf{1}(S_t = s)$
- Intuitively, this represents both the frequency and recency of the state for credit assignment.
Update $V(s)$ in proportion to $E(s)$.
- $\delta_t = (R_{t+1} + \gamma V(S_{t+1})) - V(S_t)$
- $V(s) \leftarrow V(s) + \alpha \delta_t E_t(s)$

TD(0) and TD(1)

$\lambda = 0$ is the same as TD.
$\lambda = 1$ is the same as MC every-visit.

MC vs. TD

TD can learn before knowing the final outcome.
TD does not need complete episodes.
MC has an unbiased estimate of the total return, TD does not.
TD targets have lower variance than MC, as the total return can depend on many random actions.
TD is usually more efficient thant MC.
TD is sensitive to the initial value.
TD exploits the Markov Property.

Lecture 5: Model-Free Control

On-policy vs. off-policy

On-policy, “learn on the job”:
- Learn about $\pi$ from experience sampled from $\pi$.
Off-policy, “look over someone’s shoulder”:
- Learn about $\pi$ from experience sampled from $\pi'$.

Epsilon-greedy policy improvement

Idea for continuing exploration.
Used when the full MDP is not known.
With probability $1 - \epsilon$ choose the greedy action.
With probability $\epsilon$ choose a random action.

Greedy in the Limit with Infinite Exploration (GLIE)

Explore all state-action pairs infinitely many times.
Converge onto a greedy policy.
E.g. $\epsilon$-greedy is GLIE if $\epsilon$ converges to zero with $\epsilon_k = \frac{1}{k}$.

Monte-Carlo control

Sample an episode $S_1, A_1, R_2, ..., S_t \sim \pi$.
For each time step $t$:
- $N(S_t, A_t) \leftarrow N(S_t, A_t) + 1$
- $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac{1}{N(S_t, A_t)}(G_t - Q(S_t, A_t))$
Improve the policy:
- $\epsilon \leftarrow \frac{1}{k}$
- $\pi \leftarrow \epsilon \text{-greedy}(Q)$

Sarsa(lambda)

Like Monte-Carlo control, but using TD($\lambda$).

Off-policy learning

Importance sampling

Use returns from $\mu$ to evaluate $\pi$.
Weight samples by how similar the two policies are.
For example, for MC:
- $G^{\pi / \mu}_t = \frac{\pi(A_t | S_t)}{\mu(A_t | S_t)} \frac{\pi(A_{t+1} | S_{t+1})}{\mu(A_{t+1} | S_{t+1})} ... G_t$
For example, for TD:
- $\delta^{\pi/\mu}_t = \frac{\pi(A_t | S_t)}{\mu(A_t | S_t)} \delta_t$

Q-learning

Target policy $\pi$ is different from behavior policy $\mu$.
But, we don’t use importance sampling.
Target policy $\pi$ is greedy wrt. $Q(s, a)$.
- $\pi(s) = argmax_a Q(s, a)$
Behavior policy $\mu$ is $\epsilon$-greedy wrt. $Q(s, a)$.
So, Q-learning target is simplified:
- $R_{t+1} + \gamma Q(S_{t+1}, A')$
- $R_{t+1} + \gamma Q(S_{t+1}, \arg \max_{a'} Q(S_{t+1}, a'))$
- $R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a')$

Lecture 6: Value Function Approximation

Value function approximation

Tabular methods don’t scale to large numbers of states/actions.
We use function approximators instead for e.g. $V(s), Q(s, a)$.
We use batch-updates for sample efficiency.
Otherwise, everything else pretty much stays the same.
Some algorithms aren’t guaranteed to converge, but in practice they do.

Experience replay

Gain experience consisting of $\langle state, value \rangle$ pairs:
- $\mathcal{D} = {(s_1, v^{\pi}_1), (s_2, v^{\pi}_2), ...}$
Sample batches from $\mathcal{D}$ and apply gradient descent.

Deep Q-Networks.

Uses experience replay and fixed Q-targets.

Take action $a_t$ according to $\epsilon$-greedy policy.
Store transition $s_t, a_t, r_{t+1}, s_{t+1}$ in $\mathcal{D}$.
Sample random mini-batch of transitions $(s, a, r, s')$ from $\mathcal{D}$.
Optimize behavior policy weights $w_i$ using target policy weights $w^-_i$:

\[ \mathcal{L}(w_i) = \mathbb{E}_{s,a,r,s' \sim \mathcal{D}} [ (r + \gamma \max_{a'} Q(s', a' ; w^-_i) - Q(s, a; w_i))^2 ] \]

Lecture 7: Policy Gradient

Policy objective functions

How do we score policies?

Start value:
- $J(\theta) = V_\theta(s_1)$
Average value:
- $J(\theta) = \sum_s d(s) V_\theta(s_1)$
- where $d(s)$ is the stationary distribution of the Markov chain for $\pi_\theta$
Average reward per time step:
- $J(\theta) = \sum_s d(s) \sum_a \pi_\theta(s, a) \mathcal{R}^a_s$

Likelihood ratio trick

$\nabla_\theta \pi_\theta(s, a)$
$= \pi_\theta(s, a) \frac{\nabla_\theta \pi_\theta(s, a)}{\pi_\theta(s, a)}$
$= \pi_\theta(s, a) \nabla_\theta \log \pi_\theta(s, a)$
The score function is $\nabla_\theta \log \pi_\theta(s, a)$

Policy gradient theorem

For any differentiable policy $\pi_\theta(s, a)$.
For start value, average value, average reward per time step cost functions.
The policy gradient is:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[ \nabla_\theta \log \pi_\theta(s, a) Q^{\pi_\theta}(s, a) ] \]

One-step MDP (illustration)

Starting in a state $s \sim d(s)$.
Terminating after one time step.
Given reward $r = \mathcal{R}^a_s$.
Compute policy gradient:
- $J(\theta) = \mathbb{E}_{\pi_\theta}[r]$
- $J(\theta) = \sum_s d(s) \sum_a \pi_\theta(s, a) \mathcal{R}^a_s$
- $\nabla_\theta J(\theta) = \sum_s d(s) \sum_a \pi_\theta(s, a) \nabla_\theta \log \pi_\theta(s, a) \mathcal{R}^a_s$
- $\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d, a \sim \pi_\theta} \nabla_\theta \log \pi_\theta(s, a) \mathcal{R}^a_s$

Monte-Carlo Policy gradient (REINFORCE)

Update policy using policy gradient.
Target is the cumulative discounted reward.
For each episode, for each time step $t$ in the episode:
- $\theta \leftarrow \theta \alpha \nabla_\theta \log \pi_\theta(s_t, a_t) v_t$
- where $\theta$ is the policy parameters, and $v_t$ is the cumulative discounted reward.

Criticism of REINFORCE

The Monte-Carlo estimates are high variance, which makes learning a policy harder.

Actor-Critic methods

A critic estimates the Q-values.
An actor updates its policy in the direction of the Q-values.
Reduces variance compared to REINFORCE due to bootstrapping.
We can use actors & critics at different time scales (MC, TD(0), TD($\lambda$)).

Compatible Function Approximation theorem

The policy gradient is exact in Actor-Critic methods if:

Value function approximator is compatible to the policy:
- $\nabla_w Q_w(s, a) = \nabla_\theta \log \pi_\theta(s, a)$
Value function parameters minimize the mean-squared error.

Advantage function

$Q(s, a)$ has a lot of variance across different states, but not actions.
So instead we use an advantage function, $A(s, a) = Q(s, a) - V(s)$.
We don’t need to learn both $Q$ and $V$, as we can just use the TD error:
- $Q(s, a) - V(s)$
- $= \mathbb{E}[r + \gamma V(s') | s, a] - V(s)$
- $= \delta$