# Introduction to Reinforcement Learning

Going through lecture series: Introduction to Reinforcement Learning by David Silver

## Lecture 1: Introduction to Reinforcement learning

### RL vs. the rest of ML

• There is no supervisor, only a reward.
• Feedback is delayed.
• Time steps are non-IDD.
• Observations and actions are non-IDD (actions affect observations).

### Reward hypothesis

All goals can be described by the maximization of expected cumulative reward.

### Formulation

At each time step $$t$$:

• The agent executes action $$A_t$$.
• The environment emits observation $$O_t$$.
• The environment emits reward $$R_t$$.

The history is the sequence of actions, observations, and rewards: $H_t = O_1, R_1, A_1, ..., A_{t-1}, O_t, R_t$

The state is a function of the history: $S_t = f(H_t)$

### Environment vs. agent states

• The environment state $$S^e_t$$ is the private representation.
• e.g. the internal memory of the game.
• It may contain irrelevant information.
• The agent state $$S^a_t$$ is the agent’s internal representation.
• Derived from the observations, $$S^a_t = f(H_t)$$

### Fully vs. partially observable environments

Either the environment state is fully observable, $$O_t = S^a_t$$, or not $$O_t \neq S^a_t$$.

This is the difference between a Markov Decision Process (MDP) and a Partially Observable Markov Decision Process (POMDP).

### RL agent components

• The policy defines the agent’s behavior.
• Maps state to action $$a_{t+1} = \pi(s_t)$$.
• Or stochastically: $$\pi(a_{t+1} | s_t) = \mathbb{P}[A_{t} = a | S_t = s]$$.
• The value function predicts the future reward.
• $$v_{\pi}(s) = \mathbb{E}_{\pi}[R_{t+1} + \gamma R_{t + 1} + \gamma^2 R_{t + 1} + ....] | S_t = s$$
• Can also be conditioned on an action to be taken at $$S_t$$.
• The model of the environment.
• Predicts what the environment will do next.
• $$\mathcal{P}$$ predicts next state, $$\mathcal{R}$$ predicts next reward.
• $$\mathcal{P}^a_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s, A_t = a]$$
• $$\mathcal{R}^a_{s} = \mathbb{P}[R_{t+1} = r | S_t = s, A_t = a]$$

Not all agents will explicitly model all components, for example:

• Value based agents infer their policy from their value function.
• Policy based agents will learn the policy directly, and not implement a value function.
• Actor-critic agents will learn both the value function and the policy.

#### Model free vs. model based

• Model free agents have a policy and/or a value function.
• Model based agents have a model (and could have policy and value function too).

## Lecture 2: Markov Decision Processes

### Markov property

A state $$S_t$$ is Markov iff $\mathbb{P}[S_{t+1} | S_t] = \mathbb{P}[S_{t+1} | S_1, ... S_t]$

i.e. the future is independent of the past given the present.

i.e. the state is a sufficient statistic of the future.

### State transition matrix

• Given a current state $$s$$, get the probability of the next state $$s'$$.
• $$\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s]$$
• You can represent $$\mathcal{P}$$ as a $$(n \times n)$$ matrix where $$n$$ is the number of states, and each row sums to 1.

### Markov Process/Markov Chain

A Markov Process is a tuple $$\langle\mathcal{S}, \mathcal{P}\rangle$$:

• $$\mathcal{S}$$ is a finite set of states.
• $$\mathcal{P}$$ is a state transition probability matrix.
• $$\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s]$$

### Markov Reward Process

A Markov Reward Process is a a tuple $$\langle \mathcal{S}, \mathcal{P}, \mathcal{R}, \gamma \rangle$$:

• $$\mathcal{S}$$ is a finite set of states.
• $$\mathcal{P}$$ is a state transition probability matrix.
• $$\mathcal{R}$$ is a reward function
• $$\mathcal{R}_s = \mathbb{E}[R_{t+1} | S_t = s]$$
• $$\gamma$$ is a discount factor.
• $$\gamma \in [0, 1]$$
• $$\gamma = 0$$ leads to “myopic” evaluation.
• $$\gamma = 1$$ leads to “far-sighted” evaluation.

#### Return

The return $$G_t$$ is the total discounted reward from time step $$t$$: $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t + 3} + ... = \sum_{k=0...\infty} \gamma^k R_{t + k + 1}$

#### Why discount factors

• Mathematically convenient discount functions.
• Avoids infinite cyclic returns.

#### Value function

The state value function $$v(s)$$ is the expected return starting from $$s$$: $v(s) = \mathbb{E}[G_t | S_t = s]$

#### Bellman Equation for MRPs

• $$v(s) = \mathbb{E}[G_t | S_t = s]$$
• $$v(s) = \mathbb{E}[R_t + \gamma G_{t + 1} | S_t = s]$$
• $$v(s) = \mathbb{E}[R_t + \gamma v(S_{t+1}) | S_t = s]$$
• $$v(s) = \mathcal{R}_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}v(s')$$

If we regard $$v, \mathcal{P}, \mathcal{R}$$ as matrices, we can write: $v = \mathcal{R} + \gamma \mathcal{P} v$

#### Solving the Bellman Equation for small MRPs

• $$v = \mathcal{R} + \gamma \mathcal{P} v$$
• $$v - \gamma \mathcal{P} v = \mathcal{R}$$
• $$(I - \gamma \mathcal{P}) v = \mathcal{R}$$
• $$v = (I - \gamma \mathcal{P})^{-1} \mathcal{R}$$

### Markov Decision Process

A Markov Decision Process is a tuple $$\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle$$:

• $$\mathcal{S}$$ is a finite set of states.
• $$\mathcal{A}$$ is a finite set of actions.
• $$\mathcal{P}$$ is a state transition probability matrix.
• $$\mathcal{P}^a_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s, A_t = a]$$
• $$\mathcal{R}$$ is a reward function
• $$\mathcal{R}^a_s = \mathbb{E}[R_{t+1} | S_t = s, A_t = a]$$
• $$\gamma$$ is a discount factor.

#### Policy

A policy $$\pi$$ is a distribution over actions given states: $\pi(a | s) = \mathbb{P}[A_t = a | S_t = s]$

#### Relationship to MPs and MRPs

Given an MDP and a policy:

• The state sequence $$S_1, S_2, ...$$ is a Markov Process.
• The state and reward sequence $$S_1, R_1, S_2, R_2, ...$$ is a Markov Reward Process.
• Where the reward distribution & state transition distribution are averaged over the policy’s distribution.

#### State-value and action-value functions

• The state-value function:
• $$v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s]$$
• The action-value function:
• $$q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a]$$

#### Bellman Equation for MDPs

We can define $$v$$ and $$q$$ in terms of each other:

• $$v(s) = \sum_{a} \pi(a | s) q(s, a)$$
• $$q(s, a) = \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v(s')$$

And then we can expand so that they don’t depend on each other:

• $$v(s) = \sum_{a} \pi(a | s) (\mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v(s'))$$
• $$q(s, a) = \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} \sum_{a} \pi(a | s) q(s, a)$$

#### Optimal value functions

We define optimal policies as:

• $$v_*(s) = \max_{\pi} v_{\pi}(s)$$
• $$q_*(s, a) = \max_{\pi} q_{\pi}(s, a)$$

We define an ordering over policies:

• $$\pi \geq \pi'$$ if $$v_{\pi}(s) \geq v_{\pi'}(s), \forall s$$
##### Optimal policy theorem
• There exists an optimal policy $$\pi_*$$ that is better than all other policies $$\pi_* \geq \pi, \forall \pi$$
• All optimal policies achieve the optimal state-value function $$v_*$$.
• All optimal policies achieve the optimal action-value function $$q_*$$.
• We can derive an optimal policy from $$v_*$$ or $$q_*$$.
##### Bellman optimality equation for optimal value functions
• $$v_*(s) = \max_a q_*(s, a)$$
• $$q_*(s, a) = \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v_*(s')$$
• $$v_*(s) = \max_a \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v_*(s')$$
• $$q_*(s, a) = \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} \max_a q_*(s', a)$$
##### Solving the Bellman optimality equation
• Non-linear, so no closed-form solution in general.
• We can apply iterative solutions: value iteration, policy iteration, Q-learning, SARSA…

#### Partially Observable Markov Decision Processes (POMDPs)

A POMDP is a tuple $$\langle \mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{P}, \mathcal{R}, \mathcal{Z}, \gamma \rangle$$:

• Equivalent to MDP for $$\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma$$.
• $$\mathcal{O}$$ is a finite set of observations.
• $$\mathcal{Z}$$ is an observation function.
• $$\mathcal{Z}^a_{s',o} = \mathbb{P}[O_{t+1} = o | S_{t+1} = s', A_t = a]$$
##### Belief states
• The history is $$H_t = O_1, R_1, A_1, O_2, ..., R_t$$.
• The belief state $$b(h)$$ is a probability distribution over states conditioned on $$h$$:
• $$b(h) = \mathbb{P}[S_t | h]$$

## Lecture 3: Planning by Dynamic Programming

### Iterative policy evaluation

• Problem: Evaluate a policy $$\pi$$.
• Solution: Iteratively apply Bellman equation.
• Method:
• At each iteration $$k + 1$$.
• For all states $$s \in \mathcal{S}$$.
• Update $$v_{k + 1}(s)$$ using $$v_k(s')$$, where $$s'$$ is the successor state of $$s$$.
• This uses synchronous backups.
• This is proven to converge.

#### Equation

$v_{k + 1} = \mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} v_k$

### Policy iteration

Given a policy $$\pi$$:

• Evaluate the policy using iterative policy evaluation.
• Improve the policy by acting greedily:
• $$\pi' = greedy(v_{\pi})$$
• $$\pi'(s) = argmax_a q_{\pi}(s, a)$$
• $$\pi'(s) = argmax_a \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v_{\pi}(s')$$

### Value iteration

Like policy iteration, but we update the value function (and thus the policy) every iteration. There is no explicit policy. $v_{k+1} = \max_a \mathcal{R}^a + \gamma \mathcal{P}^a v_k$

### Asynchronous dynamic programming

#### Synchronous vs. asynchronous

• Synchronous: All states are backed up in parallel.
• Asynchronous: States are backed up individually.
• This reduces computation.
• This is still guaranteed to converge if all states continue to be selected.

#### In-place

In-place dynamic programming updates each state one-by-one, while synchronous dynamic programming updates all states at once.

Practically, this means that your $$v(s')$$ values might have been updated a different number of times - but they are more up-to-date.

#### Prioritized sweeping

Select the state to update by the magnitude of the Bellman error.

Practically, this requires knowing the predecessor states.

#### Real-time

Guide the updates by what the agent is currently experiencing.

• After each time step $$S_t, A_t, R_{t + 1}$$.
• $$v(S_t) = \max_{a} \mathcal{R}^a_{S_t} + \gamma \sum_{s'} \mathcal{P}^a_{S_ts'} v(s')$$
• Note that we’re still iterating over all actions and successor states.

## Lecture 4: Model-Free Prediction

### Model-free

No access to the underlying MDP: we don’t know state transitions, rewards, etc.

This is opposed to dynamic programming, which solves a known MDP.

### Monte-Carlo Policy Evaluation (MCPE)

• Learn $$v_{\pi}$$ from episodes of experience under $$\pi$$.
• $$S_1, A_1, R_2, ... S_k \sim \pi$$
• Estimate the empirical mean return, rather than the expected mean return.
• $$v(s) = \frac{1}{n} \sum_n \text{samples of values for } s$$
• No bootstrapping: Learns from complete episodes, so all episodes must terminate.

#### First-visit

• The first time $$s$$ is visited in an episode:
• Increment counter $$N(s) \leftarrow N(s) + 1$$.
• Update sum $$S(s) \leftarrow S(s) + G_t$$.
• (where $$G_t$$ is the observed cumulative discounted reward for the rest of the episode)
• Value is $$V(s) = S(s) / N(s)$$
• $$V(s) \to v_{\pi}(s)$$ as $$N(s) \to \infty$$

#### Every-visit

As first-visit, but $$N$$ and $$S$$ are updated every time a state is seen.

• Instead of storing $$S$$ and $$N$$ functions, we can store $$V(s)$$ directly:
• We can reframe $$S(s)$$ as $$V(s)$$
• $$N(s) \leftarrow N(s) + 1$$
• $$V(s) \leftarrow V(s) + \frac{1}{N(s)}(G_t - V(s))$$
• In non-stationary problems, we can tweak this so track a running mean & forget old episodes:
• $$V(s) \leftarrow V(s) + \alpha(G_t - V(s))$$
• Where $$\alpha$$ is a hyperparameter.

### Temporal-Difference Policy Evaluation (TDPE)

• Like MCPE.
• But learns from incomplete episodes by bootstrapping.
• Updates a guess towards a guess.

#### TD(0)

• $$R_{t+1} + \gamma V(S_{t+1})$$ is the TD target.
• $$\delta_t = (R_{t+1} + \gamma V(S_{t+1})) - V(S_t)$$ is the TD error.
• $$V(S_t) \leftarrow V(S_t) + \alpha \delta_t$$

#### TD(lambda)

##### N-step TD
• Define an $$n$$-step return:
• $$G^{(n)}_t = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})$$
• $$n$$-step TD learning:
• $$V(S_t) \leftarrow V(S_t) + \alpha (G^{(n)}_t - V(S_{t + n}))$$
##### Lambda return
• Average over all $$n$$-step returns $$G^{(n)}_t$$.
• Use weighting $$(1 - \lambda)\lambda^{n-1}$$
• The sum of the weights is one.
• $$G^{\lambda}_t = (1 - \lambda) \sum_n \lambda^{n-1} G^{(n)}_t$$
##### Forward-view
• $$V(S_t) \leftarrow V(S_t) + \alpha (G^{\lambda}_t - V(S_t))$$
• Looks to the future to compute $$G^{\lambda}_t$$
• Like MC, requires complete episodes.
##### Backward-view
• Same theory as forward-view, but actually practical to implement.
• Update online, every step, from incomplete episodes.
• Keep an eligibility trace for every state $$s$$.
• $$E_0(s) = 0$$
• $$E_t(s) = \gamma \lambda E_{t-1}(s) + \textbf{1}(S_t = s)$$
• Intuitively, this represents both the frequency and recency of the state for credit assignment.
• Update $$V(s)$$ in proportion to $$E(s)$$.
• $$\delta_t = (R_{t+1} + \gamma V(S_{t+1})) - V(S_t)$$
• $$V(s) \leftarrow V(s) + \alpha \delta_t E_t(s)$$
##### TD(0) and TD(1)
• $$\lambda = 0$$ is the same as TD.
• $$\lambda = 1$$ is the same as MC every-visit.

### MC vs. TD

• TD can learn before knowing the final outcome.
• TD does not need complete episodes.
• MC has an unbiased estimate of the total return, TD does not.
• TD targets have lower variance than MC, as the total return can depend on many random actions.
• TD is usually more efficient thant MC.
• TD is sensitive to the initial value.
• TD exploits the Markov Property.

## Lecture 5: Model-Free Control

### On-policy vs. off-policy

• On-policy, “learn on the job”:
• Learn about $$\pi$$ from experience sampled from $$\pi$$.
• Off-policy, “look over someone’s shoulder”:
• Learn about $$\pi$$ from experience sampled from $$\pi'$$.

### Epsilon-greedy policy improvement

• Idea for continuing exploration.
• Used when the full MDP is not known.
• With probability $$1 - \epsilon$$ choose the greedy action.
• With probability $$\epsilon$$ choose a random action.

### Greedy in the Limit with Infinite Exploration (GLIE)

• Explore all state-action pairs infinitely many times.
• Converge onto a greedy policy.
• E.g. $$\epsilon$$-greedy is GLIE if $$\epsilon$$ converges to zero with $$\epsilon_k = \frac{1}{k}$$.

### Monte-Carlo control

• Sample an episode $$S_1, A_1, R_2, ..., S_t \sim \pi$$.
• For each time step $$t$$:
• $$N(S_t, A_t) \leftarrow N(S_t, A_t) + 1$$
• $$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac{1}{N(S_t, A_t)}(G_t - Q(S_t, A_t))$$
• Improve the policy:
• $$\epsilon \leftarrow \frac{1}{k}$$
• $$\pi \leftarrow \epsilon \text{-greedy}(Q)$$

### Off-policy learning

#### Importance sampling

• Use returns from $$\mu$$ to evaluate $$\pi$$.
• Weight samples by how similar the two policies are.
• For example, for MC:
• $$G^{\pi / \mu}_t = \frac{\pi(A_t | S_t)}{\mu(A_t | S_t)} \frac{\pi(A_{t+1} | S_{t+1})}{\mu(A_{t+1} | S_{t+1})} ... G_t$$
• For example, for TD:
• $$\delta^{\pi/\mu}_t = \frac{\pi(A_t | S_t)}{\mu(A_t | S_t)} \delta_t$$

#### Q-learning

• Target policy $$\pi$$ is different from behavior policy $$\mu$$.
• But, we don’t use importance sampling.
• Target policy $$\pi$$ is greedy wrt. $$Q(s, a)$$.
• $$\pi(s) = argmax_a Q(s, a)$$
• Behavior policy $$\mu$$ is $$\epsilon$$-greedy wrt. $$Q(s, a)$$.
• So, Q-learning target is simplified:
• $$R_{t+1} + \gamma Q(S_{t+1}, A')$$
• $$R_{t+1} + \gamma Q(S_{t+1}, \arg \max_{a'} Q(S_{t+1}, a'))$$
• $$R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a')$$

## Lecture 6: Value Function Approximation

### Value function approximation

• Tabular methods don’t scale to large numbers of states/actions.
• We use function approximators instead for e.g. $$V(s), Q(s, a)$$.
• We use batch-updates for sample efficiency.
• Otherwise, everything else pretty much stays the same.
• Some algorithms aren’t guaranteed to converge, but in practice they do.

### Experience replay

• Gain experience consisting of $$\langle state, value \rangle$$ pairs:
• $$\mathcal{D} = {(s_1, v^{\pi}_1), (s_2, v^{\pi}_2), ...}$$
• Sample batches from $$\mathcal{D}$$ and apply gradient descent.

### Deep Q-Networks.

Uses experience replay and fixed Q-targets.

• Take action $$a_t$$ according to $$\epsilon$$-greedy policy.
• Store transition $$s_t, a_t, r_{t+1}, s_{t+1}$$ in $$\mathcal{D}$$.
• Sample random mini-batch of transitions $$(s, a, r, s')$$ from $$\mathcal{D}$$.
• Optimize behavior policy weights $$w_i$$ using target policy weights $$w^-_i$$:

$\mathcal{L}(w_i) = \mathbb{E}_{s,a,r,s' \sim \mathcal{D}} [ (r + \gamma \max_{a'} Q(s', a' ; w^-_i) - Q(s, a; w_i))^2 ]$

### Policy objective functions

How do we score policies?

• Start value:
• $$J(\theta) = V_\theta(s_1)$$
• Average value:
• $$J(\theta) = \sum_s d(s) V_\theta(s_1)$$
• where $$d(s)$$ is the stationary distribution of the Markov chain for $$\pi_\theta$$
• Average reward per time step:
• $$J(\theta) = \sum_s d(s) \sum_a \pi_\theta(s, a) \mathcal{R}^a_s$$

### Likelihood ratio trick

• $$\nabla_\theta \pi_\theta(s, a)$$
• $$= \pi_\theta(s, a) \frac{\nabla_\theta \pi_\theta(s, a)}{\pi_\theta(s, a)}$$
• $$= \pi_\theta(s, a) \nabla_\theta \log \pi_\theta(s, a)$$
• The score function is $$\nabla_\theta \log \pi_\theta(s, a)$$

• For any differentiable policy $$\pi_\theta(s, a)$$.
• For start value, average value, average reward per time step cost functions.
• The policy gradient is: $\nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[ \nabla_\theta \log \pi_\theta(s, a) Q^{\pi_\theta}(s, a) ]$

#### One-step MDP (illustration)

• Starting in a state $$s \sim d(s)$$.
• Terminating after one time step.
• Given reward $$r = \mathcal{R}^a_s$$.
• $$J(\theta) = \mathbb{E}_{\pi_\theta}[r]$$
• $$J(\theta) = \sum_s d(s) \sum_a \pi_\theta(s, a) \mathcal{R}^a_s$$
• $$\nabla_\theta J(\theta) = \sum_s d(s) \sum_a \pi_\theta(s, a) \nabla_\theta \log \pi_\theta(s, a) \mathcal{R}^a_s$$
• $$\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d, a \sim \pi_\theta} \nabla_\theta \log \pi_\theta(s, a) \mathcal{R}^a_s$$

• Update policy using policy gradient.
• Target is the cumulative discounted reward.
• For each episode, for each time step $$t$$ in the episode:
• $$\theta \leftarrow \theta \alpha \nabla_\theta \log \pi_\theta(s_t, a_t) v_t$$
• where $$\theta$$ is the policy parameters, and $$v_t$$ is the cumulative discounted reward.

#### Criticism of REINFORCE

The Monte-Carlo estimates are high variance, which makes learning a policy harder.

### Actor-Critic methods

• A critic estimates the Q-values.
• An actor updates its policy in the direction of the Q-values.
• Reduces variance compared to REINFORCE due to bootstrapping.
• We can use actors & critics at different time scales (MC, TD(0), TD($$\lambda$$)).

#### Compatible Function Approximation theorem

The policy gradient is exact in Actor-Critic methods if:

• Value function approximator is compatible to the policy:
• $$\nabla_w Q_w(s, a) = \nabla_\theta \log \pi_\theta(s, a)$$
• Value function parameters minimize the mean-squared error.

• $$Q(s, a)$$ has a lot of variance across different states, but not actions.
• So instead we use an advantage function, $$A(s, a) = Q(s, a) - V(s)$$.
• We don’t need to learn both $$Q$$ and $$V$$, as we can just use the TD error:
• $$Q(s, a) - V(s)$$
• $$= \mathbb{E}[r + \gamma V(s') | s, a] - V(s)$$
• $$= \delta$$