Reinforcement Learning

Problem Definition

Image

Simple Scenario Conditioned Response

Image

Image

Reinforcement Learning

Goal: Learning how to perform a sequence of actions to maximize a global reward made of immediate and future rewards.

Architecture

Image

  • Agent: an agent takes actions within an environment
  • Action(A): A is the set of all possible moves (steps/operations) the agent can make.
  • Environment: The world through which the agent moves and which responds to the agent. For the environment, the agent's current state and action are inputs, and the outputs are the following state and the reward.
  • State(S): a state is a concrete situation the agent finds itself in.
  • Reward(R): A reward is the feedback by which we measure success or failure of the agent's action at a particular state.

Applications

  • Navigating a video game
    • Actions: running, jumping, standing still, crouching, etc
    • States: game configurations
    • Rewards: gains/losses which are dependent on the game
  • Flying a drone
    • Actions: velocities and acceleration in a 3D space
    • States: drone position in space and position of the goal (e.g. for delivery)
    • Rewards: positive (getting to goal), negative (crashing)
  • Financial domain (stock market)
    • actions: buying, selling, holding
    • States: price of diverse stocks in the market
    • Rewards: immediate gain/loss from buy/sell
  • robotic manipulations
  • self-driving cars

Strategy and Q-Value

A strategy, called a policy (π\pi), tells the agent what to do.

Evaluation of a policy should be an average over many episodes

  • average reward
  • average number of failed attempts

Searching for an optimal strategy

The goal of reinforcement learning is to find a policy that will optimize the long-term gain.

The gain is not necessarily immediate. An action taken at time tt may have an impact on the gain at time t+nt+n.

We would like to discover (learn) a strategy that would tell the agent what action to do in each state. If this strategy leads to the maximum long-term gain, then we would have an optimal strategy (π\pi *). This strategy wil rely on the Q-value, the estimated long term reward.

Defining Q-Value

Q-value (Q) is the long-term expected reward, in opposition to R which is the immediate reward.

Q(s,a)(s,a) refers to the expected long term return of taking action aa in the state ss

Q*(s,a)(s,a) refers to the MAXIMUM expected long term return of taking action aa in the state ss.

The best value of Q in one state depends on teh best value of Q in the next state. The value Q is expressed using the discount factor γ\gamma on future rewards.

Q(s,a)=r0+γmaxaQ(s,a)Q*(s,a) = r_0 + \gamma max_{a'} Q*(s',a')

Greedy Policy

In terms of policy (π)(\pi *), what action should the agent take?

π(s)=argmaxaQ(s,a)\pi*(s) = argmax_a Q*(s,a)

Q-Learning

Q-learning is the idea to learn better and better approximations of the Q* values by playing again and again and again!

Q-learning comprises of two components, a reward table and a Q table. The reward table remains fixed whereas the Q table is constantly updated.

Algorithm

Initialization

  • Build Reward table
  • Initialize Q-Table with values at random

Learning - Play multiple episodes

  • Start in a state S
  • Repeat
    • for all possible actions form the state (S) select the one A with the highest Q-value
    • Update Q-table value Q(S,A)
    • Travel to the next state (S') as a result of that action Action
    • If S' is the goal state, then end
    • else set S as S' (next state becomes current state)

Updating the Q-table:

  • Q(state, action) <-- (1-α\alpha)Q(state, action) + α\alpha (reward + γ\gamma max Q*(next state, all actions))
  • α\alpha is the learning factor
  • γ\gamma is the discount factor

Exploration

During learning, our optimal π^\hat{\pi}* policy is not at its best ... as we are in the process of learning it. The agent should continue a bit of exploration, and use an ϵ\epsilon-greedy strategy.

Proportions (ϵ\epsilon) can change to more exploitation and less exploration as the Q-table contains better approximations.

Deep Q-Learning

Limitations of Q-Learning arise when we factor in several different constraints or states that an agent could be in. For complex problems Q-Learning doesn't help.

With deep Q-Learning, the idea is to learn the function Q* with a neural network. The inputs to the neural net are states, with DQN hidden layers and an output for each action.

Building a Game Memory

  1. Initialization
    • Play with random actions for a few games (until the goal is reached, or crash/loss). Fill the memory M of size N, with the transitions in these games. Here N is the number of games that occurred.
    • initialize two networks: player network (R1), learning network (R2). R1 is a copy of R2 in this initial stage.
  2. Learning
    • Repeat until done:
    • Train
      • Use the game memory to train network (R2)
    • Play
      • Repeat for X games
        • Play using strategy π^\hat{\pi}* that will be ϵ\epsilon-greedy using the maximum values Q*(s, a) as predicted by the R2 network.
    • Update
      • Fille the game memory with the new games. As the memory has a limited size, always replace the oldest games of memory with the new games.
      • Copy network R2 to network R1
  3. Train
    • For each example,ii, of the memory ei=(si,ai,si,ri)e_i = (s_i, a_i, s'_i, r_i)
      • Forward pass: predict qi=Q^(s,a)q_i = \hat{Q} (s, a) with network R2, using state sis_i as input. Note, a single output node is of interest at this point, the node corresponding to the action aa. We will look at propagating the error from that node.
      • Obtain the target ti=r0+γmaxaQ(s,a)t_i = r_0 + \gamma max_{a'} Q*(s',a'), using network R1 with a forward pass prediction using state sis_i' as input. Take maximum on output prediction
      • Calculate L2 error between the target and the prediction, and propagate this error to adjust the weights of the R2 network. Ei=1/2(tiqi)2E_i = 1/2(t_i - q_i)^2

π\pi (deep-learning): for each ss, do (ϵ\epsilon-greedy) the action corresponding to the highest value of Q(s,a)Q*(s,a) as predicted by the network when we put ss at its input.