Reinforcement Learning
Problem Definition
Simple Scenario Conditioned Response
Reinforcement Learning
Goal: Learning how to perform a sequence of actions to maximize a global reward made of immediate and future rewards.
Architecture
- Agent: an agent takes actions within an environment
- Action(A): A is the set of all possible moves (steps/operations) the agent can make.
- Environment: The world through which the agent moves and which responds to the agent. For the environment, the agent's current state and action are inputs, and the outputs are the following state and the reward.
- State(S): a state is a concrete situation the agent finds itself in.
- Reward(R): A reward is the feedback by which we measure success or failure of the agent's action at a particular state.
Applications
- Navigating a video game
- Actions: running, jumping, standing still, crouching, etc
- States: game configurations
- Rewards: gains/losses which are dependent on the game
- Flying a drone
- Actions: velocities and acceleration in a 3D space
- States: drone position in space and position of the goal (e.g. for delivery)
- Rewards: positive (getting to goal), negative (crashing)
- Financial domain (stock market)
- actions: buying, selling, holding
- States: price of diverse stocks in the market
- Rewards: immediate gain/loss from buy/sell
- robotic manipulations
- self-driving cars
Strategy and Q-Value
A strategy, called a policy (), tells the agent what to do.
Evaluation of a policy should be an average over many episodes
- average reward
- average number of failed attempts
Searching for an optimal strategy
The goal of reinforcement learning is to find a policy that will optimize the long-term gain.
The gain is not necessarily immediate. An action taken at time may have an impact on the gain at time .
We would like to discover (learn) a strategy that would tell the agent what action to do in each state. If this strategy leads to the maximum long-term gain, then we would have an optimal strategy (). This strategy wil rely on the Q-value, the estimated long term reward.
Defining Q-Value
Q-value (Q) is the long-term expected reward, in opposition to R which is the immediate reward.
Q refers to the expected long term return of taking action in the state
Q* refers to the MAXIMUM expected long term return of taking action in the state .
The best value of Q in one state depends on teh best value of Q in the next state. The value Q is expressed using the discount factor on future rewards.
Greedy Policy
In terms of policy , what action should the agent take?
Q-Learning
Q-learning is the idea to learn better and better approximations of the Q* values by playing again and again and again!
Q-learning comprises of two components, a reward table and a Q table. The reward table remains fixed whereas the Q table is constantly updated.
Algorithm
Initialization
- Build Reward table
- Initialize Q-Table with values at random
Learning - Play multiple episodes
- Start in a state S
- Repeat
- for all possible actions form the state (S) select the one A with the highest Q-value
- Update Q-table value Q(S,A)
- Travel to the next state (S') as a result of that action Action
- If S' is the goal state, then end
- else set S as S' (next state becomes current state)
Updating the Q-table:
- Q(state, action) <-- (1-)Q(state, action) + (reward + max Q*(next state, all actions))
- is the learning factor
- is the discount factor
Exploration
During learning, our optimal policy is not at its best ... as we are in the process of learning it. The agent should continue a bit of exploration, and use an -greedy strategy.
Proportions () can change to more exploitation and less exploration as the Q-table contains better approximations.
Deep Q-Learning
Limitations of Q-Learning arise when we factor in several different constraints or states that an agent could be in. For complex problems Q-Learning doesn't help.
With deep Q-Learning, the idea is to learn the function Q* with a neural network. The inputs to the neural net are states, with DQN hidden layers and an output for each action.
Building a Game Memory
- Initialization
- Play with random actions for a few games (until the goal is reached, or crash/loss). Fill the memory M of size N, with the transitions in these games. Here N is the number of games that occurred.
- initialize two networks: player network (R1), learning network (R2). R1 is a copy of R2 in this initial stage.
- Learning
- Repeat until done:
- Train
- Use the game memory to train network (R2)
- Play
- Repeat for X games
- Play using strategy that will be -greedy using the maximum values Q*(s, a) as predicted by the R2 network.
- Repeat for X games
- Update
- Fille the game memory with the new games. As the memory has a limited size, always replace the oldest games of memory with the new games.
- Copy network R2 to network R1
- Train
- For each example,, of the memory
- Forward pass: predict with network R2, using state as input. Note, a single output node is of interest at this point, the node corresponding to the action . We will look at propagating the error from that node.
- Obtain the target , using network R1 with a forward pass prediction using state as input. Take maximum on output prediction
- Calculate L2 error between the target and the prediction, and propagate this error to adjust the weights of the R2 network.
- For each example,, of the memory
(deep-learning): for each , do (-greedy) the action corresponding to the highest value of as predicted by the network when we put at its input.