Introduction to Reinforcement Learning
Going through lecture series: Introduction to Reinforcement Learning by David Silver
Lecture 1: Introduction to Reinforcement learning
RL vs. the rest of ML
- There is no supervisor, only a reward.
- Feedback is delayed.
- Time steps are non-IDD.
- Observations and actions are non-IDD (actions affect observations).
Reward hypothesis
All goals can be described by the maximization of expected cumulative reward.
Formulation
At each time step \(t\):
- The agent executes action \(A_t\).
- The environment emits observation \(O_t\).
- The environment emits reward \(R_t\).
The history is the sequence of actions, observations, and rewards:
\[ H_t = O_1, R_1, A_1, ..., A_{t-1}, O_t, R_t \]
The state is a function of the history:
\[ S_t = f(H_t) \]
Environment vs. agent states
- The environment state \(S^e_t\) is the private
representation.
- e.g. the internal memory of the game.
- It may contain irrelevant information.
- The agent state \(S^a_t\) is the agent’s
internal representation.
- Derived from the observations, \(S^a_t = f(H_t)\)
Fully vs. partially observable environments
Either the environment state is fully observable, \(O_t = S^a_t\), or not \(O_t \neq S^a_t\).
This is the difference between a Markov Decision Process (MDP) and a Partially Observable Markov Decision Process (POMDP).
RL agent components
- The policy defines the agent’s
behavior.
- Maps state to action \(a_{t+1} = \pi(s_t)\).
- Or stochastically: \(\pi(a_{t+1} | s_t) = \mathbb{P}[A_{t} = a | S_t = s]\).
- The value function predicts the
future reward.
- \(v_{\pi}(s) = \mathbb{E}_{\pi}[R_{t+1} + \gamma R_{t + 1} + \gamma^2 R_{t + 1} + ....] | S_t = s\)
- Can also be conditioned on an action to be taken at \(S_t\).
- The model of the environment.
- Predicts what the environment will do next.
- \(\mathcal{P}\) predicts next state, \(\mathcal{R}\) predicts next reward.
- \(\mathcal{P}^a_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s, A_t = a]\)
- \(\mathcal{R}^a_{s} = \mathbb{P}[R_{t+1} = r | S_t = s, A_t = a]\)
Not all agents will explicitly model all components, for example:
- Value based agents infer their policy from their value function.
- Policy based agents will learn the policy directly, and not implement a value function.
- Actor-critic agents will learn both the value function and the policy.
Model free vs. model based
- Model free agents have a policy and/or a value function.
- Model based agents have a model (and could have policy and value function too).
Lecture 2: Markov Decision Processes
Markov property
A state \(S_t\) is Markov iff $$ [S_{t+1} | S_t] = [S_{t+1} | S_1, … S_t] $$
i.e. the future is independent of the past given the present.
i.e. the state is a sufficient statistic of the future.
State transition matrix
- Given a current state \(s\), get the probability of the next state \(s'\).
- \(\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s]\)
- You can represent \(\mathcal{P}\) as a \((n \times n)\) matrix where \(n\) is the number of states, and each row sums to 1.
Markov Process/Markov Chain
A Markov Process is a tuple \(\langle\mathcal{S}, \mathcal{P}\rangle\):
- \(\mathcal{S}\) is a finite set of states.
- \(\mathcal{P}\) is
a state transition probability matrix.
- \(\mathcal{P}_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s]\)
Markov Reward Process
A Markov Reward Process is a tuple \(\langle \mathcal{S}, \mathcal{P}, \mathcal{R}, \gamma \rangle\):
- \(\mathcal{S}\) is a finite set of states.
- \(\mathcal{P}\) is a state transition probability matrix.
- \(\mathcal{R}\) is
a reward function
- \(\mathcal{R}_s = \mathbb{E}[R_{t+1} | S_t = s]\)
- \(\gamma\) is a
discount factor.
- \(\gamma \in [0, 1]\)
- \(\gamma = 0\) leads to “myopic” evaluation.
- \(\gamma = 1\) leads to “far-sighted” evaluation.
Return
The return \(G_t\) is the total discounted reward from time step \(t\):
\[ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t + 3} + ... = \sum_{k=0...\infty} \gamma^k R_{t + k + 1} \]
Why discount factors
- Mathematically convenient discount functions.
- Avoids infinite cyclic returns.
- Adds uncertainty about the future.
Value function
The state value function \(v(s)\) is the expected return starting from \(s\):
\[ v(s) = \mathbb{E}[G_t | S_t = s] \]
Bellman Equation for MRPs
- \(v(s) = \mathbb{E}[G_t | S_t = s]\)
- \(v(s) = \mathbb{E}[R_t + \gamma G_{t + 1} | S_t = s]\)
- \(v(s) = \mathbb{E}[R_t + \gamma v(S_{t+1}) | S_t = s]\)
- \(v(s) = \mathcal{R}_s + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}v(s')\)
If we regard \(v, \mathcal{P}, \mathcal{R}\) as matrices, we can write:
\[ v = \mathcal{R} + \gamma \mathcal{P} v \]
Solving the Bellman Equation for small MRPs
- \(v = \mathcal{R} + \gamma \mathcal{P} v\)
- \(v - \gamma \mathcal{P} v = \mathcal{R}\)
- \((I - \gamma \mathcal{P}) v = \mathcal{R}\)
- \(v = (I - \gamma \mathcal{P})^{-1} \mathcal{R}\)
Markov Decision Process
A Markov Decision Process is a tuple \(\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle\):
- \(\mathcal{S}\) is a finite set of states.
- \(\mathcal{A}\) is a finite set of actions.
- \(\mathcal{P}\) is
a state transition probability matrix.
- \(\mathcal{P}^a_{ss'} = \mathbb{P}[S_{t+1} = s' | S_t = s, A_t = a]\)
- \(\mathcal{R}\) is
a reward function
- \(\mathcal{R}^a_s = \mathbb{E}[R_{t+1} | S_t = s, A_t = a]\)
- \(\gamma\) is a discount factor.
Policy
A policy \(\pi\) is a distribution over actions given states:
\[ \pi(a | s) = \mathbb{P}[A_t = a | S_t = s] \]
Relationship to MPs and MRPs
Given an MDP and a policy:
- The state sequence \(S_1, S_2, ...\) is a Markov Process.
- The state and reward sequence \(S_1, R_1, S_2, R_2, ...\) is a Markov Reward Process.
- Where the reward distribution & state transition distribution are averaged over the policy’s distribution.
State-value and action-value functions
- The state-value function:
- \(v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s]\)
- The action-value function:
- \(q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a]\)
Bellman Equation for MDPs
We can define \(v\) and \(q\) in terms of each other:
- \(v(s) = \sum_{a} \pi(a | s) q(s, a)\)
- \(q(s, a) = \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v(s')\)
And then we can expand so that they don’t depend on each other:
- \(v(s) = \sum_{a} \pi(a | s) (\mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v(s'))\)
- \(q(s, a) = \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} \sum_{a} \pi(a | s) q(s, a)\)
Optimal value functions
We define optimal policies as:
- \(v_*(s) = \max_{\pi} v_{\pi}(s)\)
- \(q_*(s, a) = \max_{\pi} q_{\pi}(s, a)\)
We define an ordering over policies:
- \(\pi \geq \pi'\) if \(v_{\pi}(s) \geq v_{\pi'}(s), \forall s\)
Optimal policy theorem
- There exists an optimal policy \(\pi_*\) that is better than all other policies \(\pi_* \geq \pi, \forall \pi\)
- All optimal policies achieve the optimal state-value function \(v_*\).
- All optimal policies achieve the optimal action-value function \(q_*\).
- We can derive an optimal policy from \(v_*\) or \(q_*\).
Bellman optimality equation for optimal value functions
- \(v_*(s) = \max_a q_*(s, a)\)
- \(q_*(s, a) = \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v_*(s')\)
- \(v_*(s) = \max_a \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v_*(s')\)
- \(q_*(s, a) = \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} \max_a q_*(s', a)\)
Solving the Bellman optimality equation
- Non-linear, so no closed-form solution in general.
- We can apply iterative solutions: value iteration, policy iteration, Q-learning, SARSA…
Partially Observable Markov Decision Processes (POMDPs)
A POMDP is a tuple \(\langle \mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{P}, \mathcal{R}, \mathcal{Z}, \gamma \rangle\):
- Equivalent to MDP for \(\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma\).
- \(\mathcal{O}\) is a finite set of observations.
- \(\mathcal{Z}\) is
an observation function.
- \(\mathcal{Z}^a_{s',o} = \mathbb{P}[O_{t+1} = o | S_{t+1} = s', A_t = a]\)
Belief states
- The history is \(H_t = O_1, R_1, A_1, O_2, ..., R_t\).
- The belief state \(b(h)\) is a probability
distribution over states conditioned on \(h\):
- \(b(h) = \mathbb{P}[S_t | h]\)
Lecture 3: Planning by Dynamic Programming
Iterative policy evaluation
- Problem: Evaluate a policy \(\pi\).
- Solution: Iteratively apply Bellman equation.
- Method:
- At each iteration \(k + 1\).
- For all states \(s \in \mathcal{S}\).
- Update \(v_{k + 1}(s)\) using \(v_k(s')\), where \(s'\) is the successor state of \(s\).
- This uses synchronous backups.
- This is proven to converge.
Equation
\[ v_{k + 1} = \mathcal{R}^{\pi} + \gamma \mathcal{P}^{\pi} v_k \]
Policy iteration
Given a policy \(\pi\):
- Evaluate the policy using iterative policy evaluation.
- Improve the policy by acting greedily:
- \(\pi' = greedy(v_{\pi})\)
- \(\pi'(s) = argmax_a q_{\pi}(s, a)\)
- \(\pi'(s) = argmax_a \mathcal{R}^a_s + \gamma \sum_{s'} \mathcal{P}^a_{ss'} v_{\pi}(s')\)
Value iteration
Like policy iteration, but we update the value function (and thus the policy) every iteration. There is no explicit policy.
\[ v_{k+1} = \max_a \mathcal{R}^a + \gamma \mathcal{P}^a v_k \]
Asynchronous dynamic programming
Synchronous vs. asynchronous
- Synchronous: All states are backed up in parallel.
- Asynchronous: States are backed up individually.
- This reduces computation.
- This is still guaranteed to converge if all states continue to be selected.
In-place
In-place dynamic programming updates each state one-by-one, while synchronous dynamic programming updates all states at once.
Practically, this means that your \(v(s')\) values might have been updated a different number of times - but they are more up-to-date.
Prioritized sweeping
Select the state to update by the magnitude of the Bellman error.
Practically, this requires knowing the predecessor states.
Real-time
Guide the updates by what the agent is currently experiencing.
- After each time step \(S_t, A_t, R_{t + 1}\).
- \(v(S_t) = \max_{a} \mathcal{R}^a_{S_t} + \gamma \sum_{s'} \mathcal{P}^a_{S_ts'} v(s')\)
- Note that we’re still iterating over all actions and successor states.
Lecture 4: Model-Free Prediction
Model-free
No access to the underlying MDP: we don’t know state transitions, rewards, etc.
This is opposed to dynamic programming, which solves a known MDP.
Monte-Carlo Policy Evaluation (MCPE)
- Learn \(v_{\pi}\)
from episodes of experience under \(\pi\).
- \(S_1, A_1, R_2, ... S_k \sim \pi\)
- Estimate the empirical mean return, rather
than the expected mean return.
- \(v(s) = \frac{1}{n} \sum_n \text{samples of values for } s\)
- No bootstrapping: Learns from complete episodes, so all episodes must terminate.
First-visit
- The first time \(s\) is visited in
an episode:
- Increment counter \(N(s) \leftarrow N(s) + 1\).
- Update sum \(S(s)
\leftarrow S(s) + G_t\).
- (where \(G_t\) is the observed cumulative discounted reward for the rest of the episode)
- Value is \(V(s) = S(s) / N(s)\)
- \(V(s) \to v_{\pi}(s)\) as \(N(s) \to \infty\)
Every-visit
As first-visit, but \(N\) and \(S\) are updated every time a state is seen.
Incremental updates
- Instead of storing \(S\) and \(N\) functions, we can store \(V(s)\) directly:
- We can reframe \(S(s)\) as \(V(s)\)
- \(N(s) \leftarrow N(s) + 1\)
- \(V(s) \leftarrow V(s) + \frac{1}{N(s)}(G_t - V(s))\)
- In non-stationary problems, we can tweak this so
track a running mean & forget old episodes:
- \(V(s) \leftarrow V(s) + \alpha(G_t - V(s))\)
- Where \(\alpha\) is a hyperparameter.
Temporal-Difference Policy Evaluation (TDPE)
- Like MCPE.
- But learns from incomplete episodes by bootstrapping.
- Updates a guess towards a guess.
TD(0)
- \(R_{t+1} + \gamma V(S_{t+1})\) is the TD target.
- \(\delta_t = (R_{t+1} + \gamma V(S_{t+1})) - V(S_t)\) is the TD error.
- \(V(S_t) \leftarrow V(S_t) + \alpha \delta_t\)
TD(lambda)
N-step TD
- Define an \(n\)-step return:
- \(G^{(n)}_t = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^n V(S_{t+n})\)
- \(n\)-step TD
learning:
- \(V(S_t) \leftarrow V(S_t) + \alpha (G^{(n)}_t - V(S_{t + n}))\)
Lambda return
- Average over all \(n\)-step returns \(G^{(n)}_t\).
- Use weighting \((1 -
\lambda)\lambda^{n-1}\)
- The sum of the weights is one.
- \(G^{\lambda}_t = (1 - \lambda) \sum_n \lambda^{n-1} G^{(n)}_t\)
Forward-view
- \(V(S_t) \leftarrow V(S_t) + \alpha (G^{\lambda}_t - V(S_t))\)
- Looks to the future to compute \(G^{\lambda}_t\)
- Like MC, requires complete episodes.
Backward-view
- Same theory as forward-view, but actually practical to implement.
- Update online, every step, from incomplete episodes.
- Keep an eligibility trace for every
state \(s\).
- \(E_0(s) = 0\)
- \(E_t(s) = \gamma \lambda E_{t-1}(s) + \textbf{1}(S_t = s)\)
- Intuitively, this represents both the frequency and recency of the state for credit assignment.
- Update \(V(s)\) in
proportion to \(E(s)\).
- \(\delta_t = (R_{t+1} + \gamma V(S_{t+1})) - V(S_t)\)
- \(V(s) \leftarrow V(s) + \alpha \delta_t E_t(s)\)
TD(0) and TD(1)
- \(\lambda = 0\) is the same as TD.
- \(\lambda = 1\) is the same as MC every-visit.
MC vs. TD
- TD can learn before knowing the final outcome.
- TD does not need complete episodes.
- MC has an unbiased estimate of the total return, TD does not.
- TD targets have lower variance than MC, as the total return can depend on many random actions.
- TD is usually more efficient thant MC.
- TD is sensitive to the initial value.
- TD exploits the Markov Property.
Lecture 5: Model-Free Control
On-policy vs. off-policy
- On-policy, “learn on the job”:
- Learn about \(\pi\) from experience sampled from \(\pi\).
- Off-policy, “look over someone’s shoulder”:
- Learn about \(\pi\) from experience sampled from \(\pi'\).
Epsilon-greedy policy improvement
- Idea for continuing exploration.
- Used when the full MDP is not known.
- With probability \(1 - \epsilon\) choose the greedy action.
- With probability \(\epsilon\) choose a random action.
Greedy in the Limit with Infinite Exploration (GLIE)
- Explore all state-action pairs infinitely many times.
- Converge onto a greedy policy.
- E.g. \(\epsilon\)-greedy is GLIE if \(\epsilon\) converges to zero with \(\epsilon_k = \frac{1}{k}\).
Monte-Carlo control
- Sample an episode \(S_1, A_1, R_2, ..., S_t \sim \pi\).
- For each time step \(t\):
- \(N(S_t, A_t) \leftarrow N(S_t, A_t) + 1\)
- \(Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac{1}{N(S_t, A_t)}(G_t - Q(S_t, A_t))\)
- Improve the policy:
- \(\epsilon \leftarrow \frac{1}{k}\)
- \(\pi \leftarrow \epsilon \text{-greedy}(Q)\)
Sarsa(lambda)
- Like Monte-Carlo control, but using TD(\(\lambda\)).
Off-policy learning
Importance sampling
- Use returns from \(\mu\) to evaluate \(\pi\).
- Weight samples by how similar the two policies are.
- For example, for MC:
- \(G^{\pi / \mu}_t = \frac{\pi(A_t | S_t)}{\mu(A_t | S_t)} \frac{\pi(A_{t+1} | S_{t+1})}{\mu(A_{t+1} | S_{t+1})} ... G_t\)
- For example, for TD:
- \(\delta^{\pi/\mu}_t = \frac{\pi(A_t | S_t)}{\mu(A_t | S_t)} \delta_t\)
Q-learning
- Target policy \(\pi\) is different from behavior policy \(\mu\).
- But, we don’t use importance sampling.
- Target policy \(\pi\) is greedy wrt. \(Q(s, a)\).
- \(\pi(s) = argmax_a Q(s, a)\)
- Behavior policy \(\mu\) is \(\epsilon\)-greedy wrt. \(Q(s, a)\).
- So, Q-learning target is simplified:
- \(R_{t+1} + \gamma Q(S_{t+1}, A')\)
- \(R_{t+1} + \gamma Q(S_{t+1}, \arg \max_{a'} Q(S_{t+1}, a'))\)
- \(R_{t+1} + \gamma \max_{a'} Q(S_{t+1}, a')\)
Lecture 6: Value Function Approximation
Value function approximation
- Tabular methods don’t scale to large numbers of states/actions.
- We use function approximators instead for e.g. \(V(s), Q(s, a)\).
- We use batch-updates for sample efficiency.
- Otherwise, everything else pretty much stays the same.
- Some algorithms aren’t guaranteed to converge, but in practice they do.
Experience replay
- Gain experience consisting of \(\langle state, value
\rangle\) pairs:
- \(\mathcal{D} = {(s_1, v^{\pi}_1), (s_2, v^{\pi}_2), ...}\)
- Sample batches from \(\mathcal{D}\) and apply gradient descent.
Deep Q-Networks.
Uses experience replay and fixed Q-targets.
- Take action \(a_t\) according to \(\epsilon\)-greedy policy.
- Store transition \(s_t, a_t, r_{t+1}, s_{t+1}\) in \(\mathcal{D}\).
- Sample random mini-batch of transitions \((s, a, r, s')\) from \(\mathcal{D}\).
- Optimize behavior policy weights \(w_i\) using target policy weights \(w^-_i\):
\[ \mathcal{L}(w_i) = \mathbb{E}_{s,a,r,s' \sim \mathcal{D}} [ (r + \gamma \max_{a'} Q(s', a' ; w^-_i) - Q(s, a; w_i))^2 ] \]
Lecture 7: Policy Gradient
Policy objective functions
How do we score policies?
- Start value:
- \(J(\theta) = V_\theta(s_1)\)
- Average value:
- \(J(\theta) = \sum_s d(s) V_\theta(s_1)\)
- where \(d(s)\) is the stationary distribution of the Markov chain for \(\pi_\theta\)
- Average reward per time step:
- \(J(\theta) = \sum_s d(s) \sum_a \pi_\theta(s, a) \mathcal{R}^a_s\)
Likelihood ratio trick
- \(\nabla_\theta \pi_\theta(s, a)\)
- \(= \pi_\theta(s, a) \frac{\nabla_\theta \pi_\theta(s, a)}{\pi_\theta(s, a)}\)
- \(= \pi_\theta(s, a) \nabla_\theta \log \pi_\theta(s, a)\)
- The score function is \(\nabla_\theta \log \pi_\theta(s, a)\)
Policy gradient theorem
For any differentiable policy \(\pi_\theta(s, a)\).
For start value, average value, average reward per time step cost functions.
The policy gradient is:
\[ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta}[ \nabla_\theta \log \pi_\theta(s, a) Q^{\pi_\theta}(s, a) ] \]
One-step MDP (illustration)
- Starting in a state \(s \sim d(s)\).
- Terminating after one time step.
- Given reward \(r = \mathcal{R}^a_s\).
- Compute policy gradient:
- \(J(\theta) = \mathbb{E}_{\pi_\theta}[r]\)
- \(J(\theta) = \sum_s d(s) \sum_a \pi_\theta(s, a) \mathcal{R}^a_s\)
- \(\nabla_\theta J(\theta) = \sum_s d(s) \sum_a \pi_\theta(s, a) \nabla_\theta \log \pi_\theta(s, a) \mathcal{R}^a_s\)
- \(\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d, a \sim \pi_\theta} \nabla_\theta \log \pi_\theta(s, a) \mathcal{R}^a_s\)
Monte-Carlo Policy gradient (REINFORCE)
- Update policy using policy gradient.
- Target is the cumulative discounted reward.
- For each episode, for each time step \(t\) in the episode:
- \(\theta \leftarrow \theta \alpha \nabla_\theta \log \pi_\theta(s_t, a_t) v_t\)
- where \(\theta\) is the policy parameters, and \(v_t\) is the cumulative discounted reward.
Criticism of REINFORCE
The Monte-Carlo estimates are high variance, which makes learning a policy harder.
Actor-Critic methods
- A critic estimates the Q-values.
- An actor updates its policy in the direction of the Q-values.
- Reduces variance compared to REINFORCE due to bootstrapping.
- We can use actors & critics at different time scales (MC, TD(0), TD(\(\lambda\))).
Compatible Function Approximation theorem
The policy gradient is exact in Actor-Critic methods if:
- Value function approximator is compatible to the
policy:
- \(\nabla_w Q_w(s, a) = \nabla_\theta \log \pi_\theta(s, a)\)
- Value function parameters minimize the mean-squared error.
Advantage function
- \(Q(s, a)\) has a lot of variance across different states, but not actions.
- So instead we use an advantage function, \(A(s, a) = Q(s, a) - V(s)\).
- We don’t need to learn both \(Q\) and \(V\), as we can just use the
TD error:
- \(Q(s, a) - V(s)\)
- \(= \mathbb{E}[r + \gamma V(s') | s, a] - V(s)\)
- \(= \delta\)