$\epsilon -greedy$
- Exploration Function
Q-learning with an epsilon-greedy exploration strategy is a popular reinforcement learning algorithm. The objective of Q-learning is to find an optimal policy that maximizes the cumulative reward for an agent interacting with an environment. The epsilon-greedy strategy is a method used within Q-learning to balance exploration (trying new actions) and exploitation (choosing the best-known actions) in the action selection process.
-
Initialize Q-values: For each state-action pair, initialize a Q-value (often to zero). These values will be updated to represent the expected reward of taking a particular action in a given state.
-
Choose an Action: In each state, use the epsilon-greedy strategy to choose an action:
- With probability ( \epsilon ): Choose a random action (exploration).
- With probability ( 1 - \epsilon ): Choose the action with the highest Q-value for that state (exploitation).
-
Take Action and Observe Reward: Perform the chosen action, observe the resulting reward and the next state.
-
Update Q-value: Update the Q-value for the state-action pair using the Q-learning update rule: [ Q(s, a) = (1-\alpha)Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') \right) ] where:
- ( s ) and ( a ) are the current state and action.
- ( r ) is the reward received after taking action ( a ).
- ( s' ) is the next state after taking action ( a ).
- ( a' ) represents all possible actions from ( s' ).
- ( \alpha ) is the learning rate (controls how much new information overrides old information).
- ( \gamma ) is the discount factor (determines the importance of future rewards).
-
Repeat: Repeat steps 2-4 for a specified number of episodes or until the agent reaches a stopping criterion.
Here's the pseudocode using a gridworld game where the agent's objective is to reach a goal while avoiding obstacles and collecting rewards.
Initialize Q(s, a) arbitrarily for all states s and actions a
Set parameters epsilon, alpha, and gamma
For each episode:
Initialize the starting state s
While s is not a terminal state:
Choose action a using epsilon-greedy policy on Q(s, a)
With probability epsilon: choose a random action
With probability 1 - epsilon: choose the action with max Q(s, a)
Take action a, observe reward r and next state s'
Update Q(s, a) using the Q-learning update rule:
Q(s, a) = Q(s, a) + alpha * (r + gamma * max(Q(s', a')) - Q(s, a))
Set s = s'
Reduce epsilon (optional, to encourage exploitation over time)
-
Initialization: Initialize the Q-values table for all state-action pairs, setting values to zero or small random values.
-
Epsilon-Greedy Action Selection: During each episode, the agent decides between exploration (choosing a random action) and exploitation (choosing the best-known action based on current Q-values).
-
Action Execution and Reward Observation: After taking an action, the agent receives a reward and transitions to the next state. In a gridworld, actions might involve moving up, down, left, or right.
-
Q-value Update: Update the Q-value for the previous state and action using the reward obtained and the maximum future Q-value from the next state.
-
Episode End: The episode ends when the agent reaches a terminal state (such as the goal state in a gridworld) or a set number of steps are completed.
Let's say we have a 4x4 grid, where:
- The agent starts in the top-left corner (0, 0).
- The goal is in the bottom-right corner (3, 3) with a reward of +10.
- There are obstacles with penalties (e.g., -5).
- Every step has a small negative reward to encourage reaching the goal quickly.
The agent will learn by exploring and updating its Q-values for each possible move in each grid cell.
Here's how it might look in action:
Gridworld:
S . . .
. X . .
. . X .
. . . G
Legend:
S - Start
G - Goal (+10 reward)
X - Obstacle (-5 penalty)
. - Empty space (-0.1 reward for each step)
Actions: Up, Down, Left, Right
After enough episodes, the agent will learn to navigate around obstacles and reach the goal while avoiding penalties.