# Introduction to Reinforcement Learning

Going through lecture series: Introduction to Reinforcement Learning by David Silver

## Lecture 1: Introduction to Reinforcement learning

### RL vs. the rest of ML

- There is no supervisor, only a reward.
- Feedback is delayed.
- Time steps are non-IDD.
- Observations and actions are non-IDD (actions affect observations).

### Reward hypothesis

*All* goals can be described by the
maximization of expected cumulative reward.

### Formulation

At each time step \(t\):

- The agent executes
**action**\(A_t\). - The environment emits
**observation**\(O_t\). - The environment emits
**reward**\(R_t\).

The **history** is the sequence of
actions, observations, and rewards: \[
H_t = O_1, R_1, A_1, ..., A_{t-1}, O_t, R_t
\]

The **state** is a function of the
history: \[
S_t = f(H_t)
\]

### Environment vs. agent states

- The environment state \(S^e_t\) is the private
representation.
- e.g. the internal memory of the game.
- It may contain irrelevant information.

- The agent state \(S^a_t\) is the agent’s
internal representation.
- Derived from the observations, \(S^a_t = f(H_t)\)

### Fully vs. partially observable environments

Either the environment state is fully observable, \(O_t = S^a_t\), or not \(O_t \neq S^a_t\).

This is the difference between a **Markov
Decision Process** (MDP) and a **Partially
Observable Markov Decision Process** (POMDP).

### RL agent components

- The
**policy**defines the agent’s behavior.- Maps state to action \(a_{t+1} = \pi(s_t)\).
- Or stochastically: \(\pi(a_{t+1} | s_t) = \mathbb{P}[A_{t} = a | S_t = s]\).

- The
**value function**predicts the future reward.- \(v_{\pi}(s) = \mathbb{E}_{\pi}[R_{t+1} + \gamma R_{t + 1} + \gamma^2 R_{t + 1} + ....] | S_t = s\)
- Can also be conditioned on an action to be taken at \(S_t\).

- The
**model**of the environment.- Predicts what the environment will do next.
- \(\mathcal{P}\) predicts next state, \(\mathcal{R}\) predicts next reward.
- \(\mathcal{P}^a_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s, A_t = a]\)
- \(\mathcal{R}^a_{s} = \mathbb{P}[R_{t+1} = r | S_t = s, A_t = a]\)

Not all agents will explicitly model all components, for example:

- Value based agents infer their policy from their value function.
- Policy based agents will learn the policy directly, and not implement a value function.
- Actor-critic agents will learn both the value function and the policy.

#### Model free vs. model based

**Model free**agents have a policy and/or a value function.**Model based**agents have a model (and could have policy and value function too).

## Lecture 2: Markov Decision Processes

### Markov property

A state \(S_t\) is Markov iff \[ \mathbb{P}[S_{t+1} | S_t] = \mathbb{P}[S_{t+1} | S_1, ... S_t] \]

i.e. the future is independent of the past given the present.

i.e. the state is a sufficient statistic of the future.

### State transition matrix

- Given a current state \(s\), get the probability of the next state \(s'\).
- \(\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s]\)
- You can represent \(\mathcal{P}\) as a \((n \times n)\) matrix where \(n\) is the number of states, and each row sums to 1.

### Markov Process/Markov Chain

A Markov Process is a tuple \(\langle\mathcal{S}, \mathcal{P}\rangle\):

- \(\mathcal{S}\) is a finite set of states.
- \(\mathcal{P}\) is
a state transition probability matrix.
- \(\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s]\)

### Markov Reward Process

A Markov Reward Process is a a tuple \(\langle \mathcal{S}, \mathcal{P}, \mathcal{R}, \gamma \rangle\):

- \(\mathcal{S}\) is a finite set of states.
- \(\mathcal{P}\) is a state transition probability matrix.
- \(\mathcal{R}\) is
a reward function
- \(\mathcal{R}_s = \mathbb{E}[R_{t+1} | S_t = s]\)

- \(\gamma\) is a
discount factor.
- \(\gamma \in [0, 1]\)
- \(\gamma = 0\) leads to “myopic” evaluation.
- \(\gamma = 1\) leads to “far-sighted” evaluation.

#### Return

The return \(G_t\) is the total discounted reward from time step \(t\): \[ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t + 3} + ... = \sum_{k=0...\infty} \gamma^k R_{t + k + 1} \]

#### Why discount factors

- Mathematically convenient discount functions.
- Avoids infinite cyclic returns.
- Adds uncertainty about the future.

#### Value function

The state value function \(v(s)\) is the expected return starting from \(s\): \[ v(s) = \mathbb{E}[G_t | S_t = s] \]

#### Bellman Equation for MRPs

- \(v(s) = \mathbb{E}[G_t | S_t = s]\)
- \(v(s) = \mathbb{E}[R_t + \gamma G_{t + 1} | S_t = s]\)
- \(v(s) = \mathbb{E}[R_t + \gamma v(S_{t+1}) | S_t = s]\)
- \(v(s) = \mathcal{R}_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}v(s')\)

If we regard \(v, \mathcal{P}, \mathcal{R}\) as matrices, we can write: \[ v = \mathcal{R} + \gamma \mathcal{P} v \]

#### Solving the Bellman Equation for small MRPs

- \(v = \mathcal{R} + \gamma \mathcal{P} v\)
- \(v - \gamma \mathcal{P} v = \mathcal{R}\)
- \((I - \gamma \mathcal{P}) v = \mathcal{R}\)
- \(v = (I - \gamma \mathcal{P})^{-1} \mathcal{R}\)

### Markov Decision Process

A Markov Decision Process is a tuple \(\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle\):

- \(\mathcal{S}\) is a finite set of states.
- \(\mathcal{A}\) is a finite set of actions.
- \(\mathcal{P}\) is
a state transition probability matrix.
- \(\mathcal{P}^a_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s, A_t = a]\)

- \(\mathcal{R}\) is
a reward function
- \(\mathcal{R}^a_s = \mathbb{E}[R_{t+1} | S_t = s, A_t = a]\)

- \(\gamma\) is a discount factor.

#### Policy

A policy \(\pi\) is a distribution over actions given states: \[ \pi(a | s) = \mathbb{P}[A_t = a | S_t = s] \]

#### Relationship to MPs and MRPs

Given an MDP and a policy:

- The state sequence \(S_1, S_2, ...\) is a Markov Process.
- The state and reward sequence \(S_1, R_1, S_2, R_2, ...\) is a Markov Reward Process.
- Where the reward distribution & state transition distribution are averaged over the policy’s distribution.

#### State-value and action-value functions

- The state-value function:
- \(v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s]\)

- The action-value function:
- \(q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a]\)

#### Bellman Equation for MDPs

We can define \(v\) and \(q\) in terms of each other:

- \(v(s) = \sum_{a} \pi(a | s) q(s, a)\)
- \(q(s, a) = \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v(s')\)

And then we can expand so that they don’t depend on each other:

- \(v(s) = \sum_{a} \pi(a | s) (\mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v(s'))\)
- \(q(s, a) = \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} \sum_{a} \pi(a | s) q(s, a)\)

#### Optimal value functions

We define optimal policies as:

- \(v_*(s) = \max_{\pi} v_{\pi}(s)\)
- \(q_*(s, a) = \max_{\pi} q_{\pi}(s, a)\)

We define an ordering over policies:

- \(\pi \geq \pi'\) if \(v_{\pi}(s) \geq v_{\pi'}(s), \forall s\)

##### Optimal policy theorem

- There exists an optimal policy \(\pi_*\) that is better than all other policies \(\pi_* \geq \pi, \forall \pi\)
- All optimal policies achieve the optimal state-value function \(v_*\).
- All optimal policies achieve the optimal action-value function \(q_*\).
- We can derive an optimal policy from \(v_*\) or \(q_*\).

##### Bellman optimality equation for optimal value functions

- \(v_*(s) = \max_a q_*(s, a)\)
- \(q_*(s, a) = \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v_*(s')\)
- \(v_*(s) = \max_a \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v_*(s')\)
- \(q_*(s, a) = \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} \max_a q_*(s', a)\)

##### Solving the Bellman optimality equation

- Non-linear, so no closed-form solution in general.
- We can apply iterative solutions: value iteration, policy iteration, Q-learning, SARSA…

#### Partially Observable Markov Decision Processes (POMDPs)

A POMDP is a tuple \(\langle \mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{P}, \mathcal{R}, \mathcal{Z}, \gamma \rangle\):

- Equivalent to MDP for \(\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma\).
- \(\mathcal{O}\) is a finite set of observations.
- \(\mathcal{Z}\) is
an observation function.
- \(\mathcal{Z}^a_{s',o} = \mathbb{P}[O_{t+1} = o | S_{t+1} = s', A_t = a]\)

##### Belief states

- The history is \(H_t = O_1, R_1, A_1, O_2, ..., R_t\).
- The belief state \(b(h)\) is a probability
distribution over states conditioned on \(h\):
- \(b(h) = \mathbb{P}[S_t | h]\)

## Lecture 3: Planning by Dynamic Programming

### Iterative policy evaluation

- Problem: Evaluate a policy \(\pi\).
- Solution: Iteratively apply Bellman equation.
- Method:
- At each iteration \(k + 1\).
- For all states \(s \in \mathcal{S}\).
- Update \(v_{k + 1}(s)\) using \(v_k(s')\), where \(s'\) is the successor state of \(s\).

- This uses
*synchronous*backups. - This is proven to converge.

#### Equation

\[ v_{k + 1} = \mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} v_k \]

### Policy iteration

Given a policy \(\pi\):

- Evaluate the policy using iterative policy evaluation.
- Improve the policy by acting greedily:
- \(\pi' = greedy(v_{\pi})\)
- \(\pi'(s) = argmax_a q_{\pi}(s, a)\)
- \(\pi'(s) = argmax_a \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v_{\pi}(s')\)

### Value iteration

Like policy iteration, but we update the value function (and thus the policy) every iteration. There is no explicit policy. \[ v_{k+1} = \max_a \mathcal{R}^a + \gamma \mathcal{P}^a v_k \]

### Asynchronous dynamic programming

#### Synchronous vs. asynchronous

- Synchronous: All states are backed up in parallel.
- Asynchronous: States are backed up individually.
- This reduces computation.
- This is still guaranteed to converge if all states continue to be selected.

#### In-place

In-place dynamic programming updates each state one-by-one, while synchronous dynamic programming updates all states at once.

Practically, this means that your \(v(s')\) values might have been updated a different number of times - but they are more up-to-date.

#### Prioritized sweeping

Select the state to update by the magnitude of the Bellman error.

Practically, this requires knowing the predecessor states.

#### Real-time

Guide the updates by what the agent is currently experiencing.

- After each time step \(S_t, A_t, R_{t + 1}\).
- \(v(S_t) = \max_{a} \mathcal{R}^a_{S_t} + \gamma \sum_{s'} \mathcal{P}^a_{S_ts'} v(s')\)
- Note that we’re still iterating over all actions and successor states.

## Lecture 4: Model-Free Prediction

### Model-free

No access to the underlying MDP: we don’t know state transitions, rewards, etc.

This is opposed to dynamic programming, which solves a known MDP.

### Monte-Carlo Policy Evaluation (MCPE)

- Learn \(v_{\pi}\)
from episodes of experience under \(\pi\).
- \(S_1, A_1, R_2, ... S_k \sim \pi\)

- Estimate the
*empirical*mean return, rather than the*expected*mean return.- \(v(s) = \frac{1}{n} \sum_n \text{samples of values for } s\)

- No bootstrapping: Learns from complete episodes, so all episodes must terminate.

#### First-visit

- The first time \(s\) is visited
**in an episode**:- Increment counter \(N(s) \leftarrow N(s) + 1\).
- Update sum \(S(s)
\leftarrow S(s) + G_t\).
- (where \(G_t\) is the observed cumulative discounted reward for the rest of the episode)

- Value is \(V(s) = S(s) / N(s)\)
- \(V(s) \to v_{\pi}(s)\) as \(N(s) \to \infty\)

#### Every-visit

As first-visit, but \(N\) and \(S\) are updated every time a state is seen.

#### Incremental updates

- Instead of storing \(S\) and \(N\) functions, we can store \(V(s)\) directly:
- We can reframe \(S(s)\) as \(V(s)\)
- \(N(s) \leftarrow N(s) + 1\)
- \(V(s) \leftarrow V(s) + \frac{1}{N(s)}(G_t - V(s))\)

- In non-stationary problems, we can tweak this so
track a running mean & forget old episodes:
- \(V(s) \leftarrow V(s) + \alpha(G_t - V(s))\)
- Where \(\alpha\) is a hyperparameter.

### Temporal-Difference Policy Evaluation (TDPE)

- Like MCPE.
- But learns from
*incomplete episodes*by bootstrapping. - Updates a guess towards a guess.

#### TD(0)

- \(R_{t+1} + \gamma V(S_{t+1})\) is the TD target.
- \(\delta_t = (R_{t+1} + \gamma V(S_{t+1})) - V(S_t)\) is the TD error.
- \(V(S_t) \leftarrow V(S_t) + \alpha \delta_t\)

#### TD(lambda)

##### N-step TD

- Define an \(n\)-step return:
- \(G^{(n)}_t = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})\)

- \(n\)-step TD
learning:
- \(V(S_t) \leftarrow V(S_t) + \alpha (G^{(n)}_t - V(S_{t + n}))\)

##### Lambda return

- Average over all \(n\)-step returns \(G^{(n)}_t\).
- Use weighting \((1 -
\lambda)\lambda^{n-1}\)
- The sum of the weights is one.

- \(G^{\lambda}_t = (1 - \lambda) \sum_n \lambda^{n-1} G^{(n)}_t\)

##### Forward-view

- \(V(S_t) \leftarrow V(S_t) + \alpha (G^{\lambda}_t - V(S_t))\)
- Looks to the future to compute \(G^{\lambda}_t\)
- Like MC, requires complete episodes.

##### Backward-view

- Same theory as forward-view, but actually practical to implement.
- Update online, every step, from incomplete episodes.
- Keep an
**eligibility trace**for every state \(s\).- \(E_0(s) = 0\)
- \(E_t(s) = \gamma \lambda E_{t-1}(s) + \textbf{1}(S_t = s)\)
- Intuitively, this represents both the
*frequency*and*recency*of the state for credit assignment.

- Update \(V(s)\) in
proportion to \(E(s)\).
- \(\delta_t = (R_{t+1} + \gamma V(S_{t+1})) - V(S_t)\)
- \(V(s) \leftarrow V(s) + \alpha \delta_t E_t(s)\)

##### TD(0) and TD(1)

- \(\lambda = 0\) is the same as TD.
- \(\lambda = 1\) is the same as MC every-visit.

### MC vs. TD

- TD can learn before knowing the final outcome.
- TD does not need complete episodes.
- MC has an unbiased estimate of the total return, TD does not.
- TD targets have lower variance than MC, as the total return can depend on many random actions.
- TD is usually more efficient thant MC.
- TD is sensitive to the initial value.
- TD exploits the Markov Property.

## Lecture 5: Model-Free Control

### On-policy vs. off-policy

- On-policy, “learn on the job”:
- Learn about \(\pi\) from experience sampled from \(\pi\).

- Off-policy, “look over someone’s shoulder”:
- Learn about \(\pi\) from experience sampled from \(\pi'\).

### Epsilon-greedy policy improvement

- Idea for continuing exploration.
- Used when the full MDP is not known.
- With probability \(1 - \epsilon\) choose the greedy action.
- With probability \(\epsilon\) choose a random action.

### Greedy in the Limit with Infinite Exploration (GLIE)

- Explore all state-action pairs infinitely many times.
- Converge onto a greedy policy.
- E.g. \(\epsilon\)-greedy is GLIE if \(\epsilon\) converges to zero with \(\epsilon_k = \frac{1}{k}\).

### Monte-Carlo control

- Sample an episode \(S_1, A_1, R_2, ..., S_t \sim \pi\).
- For each time step \(t\):
- \(N(S_t, A_t) \leftarrow N(S_t, A_t) + 1\)
- \(Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac{1}{N(S_t, A_t)}(G_t - Q(S_t, A_t))\)

- Improve the policy:
- \(\epsilon \leftarrow \frac{1}{k}\)
- \(\pi \leftarrow \epsilon \text{-greedy}(Q)\)

### Sarsa(lambda)

- Like Monte-Carlo control, but using TD(\(\lambda\)).

### Off-policy learning

#### Importance sampling

- Use returns from \(\mu\) to evaluate \(\pi\).
- Weight samples by how similar the two policies are.
- For example, for MC:
- \(G^{\pi / \mu}_t = \frac{\pi(A_t | S_t)}{\mu(A_t | S_t)} \frac{\pi(A_{t+1} | S_{t+1})}{\mu(A_{t+1} | S_{t+1})} ... G_t\)

- For example, for TD:
- \(\delta^{\pi/\mu}_t = \frac{\pi(A_t | S_t)}{\mu(A_t | S_t)} \delta_t\)

#### Q-learning

- Target policy \(\pi\) is different from behavior policy \(\mu\).
- But, we don’t use importance sampling.
- Target policy \(\pi\) is greedy wrt. \(Q(s, a)\).
- \(\pi(s) = argmax_a Q(s, a)\)

- Behavior policy \(\mu\) is \(\epsilon\)-greedy wrt. \(Q(s, a)\).
- So, Q-learning target is simplified:
- \(R_{t+1} + \gamma Q(S_{t+1}, A')\)
- \(R_{t+1} + \gamma Q(S_{t+1}, \arg \max_{a'} Q(S_{t+1}, a'))\)
- \(R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a')\)

## Lecture 6: Value Function Approximation

### Value function approximation

- Tabular methods don’t scale to large numbers of states/actions.
- We use function approximators instead for e.g. \(V(s), Q(s, a)\).
- We use batch-updates for sample efficiency.
- Otherwise, everything else pretty much stays the same.
- Some algorithms aren’t guaranteed to converge, but in practice they do.

### Experience replay

- Gain experience consisting of \(\langle state, value
\rangle\) pairs:
- \(\mathcal{D} = {(s_1, v^{\pi}_1), (s_2, v^{\pi}_2), ...}\)

- Sample batches from \(\mathcal{D}\) and apply gradient descent.

### Deep Q-Networks.

Uses experience replay and fixed Q-targets.

- Take action \(a_t\) according to \(\epsilon\)-greedy policy.
- Store transition \(s_t, a_t, r_{t+1}, s_{t+1}\) in \(\mathcal{D}\).
- Sample random mini-batch of transitions \((s, a, r, s')\) from \(\mathcal{D}\).
- Optimize behavior policy weights \(w_i\) using target policy weights \(w^-_i\):

\[ \mathcal{L}(w_i) = \mathbb{E}_{s,a,r,s' \sim \mathcal{D}} [ (r + \gamma \max_{a'} Q(s', a' ; w^-_i) - Q(s, a; w_i))^2 ] \]

## Lecture 7: Policy Gradient

### Policy objective functions

How do we score policies?

- Start value:
- \(J(\theta) = V_\theta(s_1)\)

- Average value:
- \(J(\theta) = \sum_s d(s) V_\theta(s_1)\)
- where \(d(s)\) is the stationary distribution of the Markov chain for \(\pi_\theta\)

- Average reward per time step:
- \(J(\theta) = \sum_s d(s) \sum_a \pi_\theta(s, a) \mathcal{R}^a_s\)

### Likelihood ratio trick

- \(\nabla_\theta \pi_\theta(s, a)\)
- \(= \pi_\theta(s, a) \frac{\nabla_\theta \pi_\theta(s, a)}{\pi_\theta(s, a)}\)
- \(= \pi_\theta(s, a) \nabla_\theta \log \pi_\theta(s, a)\)
- The
**score function**is \(\nabla_\theta \log \pi_\theta(s, a)\)

### Policy gradient theorem

- For any differentiable policy \(\pi_\theta(s, a)\).
- For start value, average value, average reward per time step cost functions.
- The policy gradient is: \[ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[ \nabla_\theta \log \pi_\theta(s, a) Q^{\pi_\theta}(s, a) ] \]

#### One-step MDP (illustration)

- Starting in a state \(s \sim d(s)\).
- Terminating after one time step.
- Given reward \(r = \mathcal{R}^a_s\).
- Compute policy gradient:
- \(J(\theta) = \mathbb{E}_{\pi_\theta}[r]\)
- \(J(\theta) = \sum_s d(s) \sum_a \pi_\theta(s, a) \mathcal{R}^a_s\)
- \(\nabla_\theta J(\theta) = \sum_s d(s) \sum_a \pi_\theta(s, a) \nabla_\theta \log \pi_\theta(s, a) \mathcal{R}^a_s\)
- \(\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d, a \sim \pi_\theta} \nabla_\theta \log \pi_\theta(s, a) \mathcal{R}^a_s\)

### Monte-Carlo Policy gradient (REINFORCE)

- Update policy using policy gradient.
- Target is the cumulative discounted reward.
- For each episode, for each time step \(t\) in the episode:
- \(\theta \leftarrow \theta \alpha \nabla_\theta \log \pi_\theta(s_t, a_t) v_t\)
- where \(\theta\) is the policy parameters, and \(v_t\) is the cumulative discounted reward.

#### Criticism of REINFORCE

The Monte-Carlo estimates are high variance, which makes learning a policy harder.

### Actor-Critic methods

- A
**critic**estimates the Q-values. - An
**actor**updates its policy in the direction of the Q-values. - Reduces variance compared to REINFORCE due to bootstrapping.
- We can use actors & critics at different time scales (MC, TD(0), TD(\(\lambda\))).

#### Compatible Function Approximation theorem

The policy gradient is **exact** in
Actor-Critic methods if:

- Value function approximator is compatible to the
policy:
- \(\nabla_w Q_w(s, a) = \nabla_\theta \log \pi_\theta(s, a)\)

- Value function parameters minimize the mean-squared error.

#### Advantage function

- \(Q(s, a)\) has a lot of variance across different states, but not actions.
- So instead we use an advantage function, \(A(s, a) = Q(s, a) - V(s)\).
- We don’t need to learn both \(Q\) and \(V\), as we can just use the
TD error:
- \(Q(s, a) - V(s)\)
- \(= \mathbb{E}[r + \gamma V(s') | s, a] - V(s)\)
- \(= \delta\)