Skip to content

a utilization of reinforcement learning on game of grid world.

Notifications You must be signed in to change notification settings

zw3413/q_learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

  • $\epsilon -greedy$

$$ Q(s,a) = (1- \alpha) * Q(s,a) + \alpha * ( r + \gamma * \max \limits_{a'} Q(s', a')) $$

  • Exploration Function

$$ Q(s,a) = (1- \alpha) * Q(s,a) + \alpha * ( r + \gamma * \max \limits_{a'} f (Q(s', a'), N(s',a')) \\ f(Q(s,a),N(s,a)) = Q(s,a) + \frac{k}{\sqrt{N(s,a)+1}} $$

Q-learning with an epsilon-greedy exploration strategy is a popular reinforcement learning algorithm. The objective of Q-learning is to find an optimal policy that maximizes the cumulative reward for an agent interacting with an environment. The epsilon-greedy strategy is a method used within Q-learning to balance exploration (trying new actions) and exploitation (choosing the best-known actions) in the action selection process.

Steps in Q-Learning with Epsilon-Greedy Exploration

  1. Initialize Q-values: For each state-action pair, initialize a Q-value (often to zero). These values will be updated to represent the expected reward of taking a particular action in a given state.

  2. Choose an Action: In each state, use the epsilon-greedy strategy to choose an action:

    • With probability ( \epsilon ): Choose a random action (exploration).
    • With probability ( 1 - \epsilon ): Choose the action with the highest Q-value for that state (exploitation).
  3. Take Action and Observe Reward: Perform the chosen action, observe the resulting reward and the next state.

  4. Update Q-value: Update the Q-value for the state-action pair using the Q-learning update rule: [ Q(s, a) = (1-\alpha)Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') \right) ] where:

    • ( s ) and ( a ) are the current state and action.
    • ( r ) is the reward received after taking action ( a ).
    • ( s' ) is the next state after taking action ( a ).
    • ( a' ) represents all possible actions from ( s' ).
    • ( \alpha ) is the learning rate (controls how much new information overrides old information).
    • ( \gamma ) is the discount factor (determines the importance of future rewards).
  5. Repeat: Repeat steps 2-4 for a specified number of episodes or until the agent reaches a stopping criterion.

Pseudocode for Q-Learning with Epsilon-Greedy in a Gridworld Environment

Here's the pseudocode using a gridworld game where the agent's objective is to reach a goal while avoiding obstacles and collecting rewards.

Initialize Q(s, a) arbitrarily for all states s and actions a
Set parameters epsilon, alpha, and gamma

For each episode:
    Initialize the starting state s
    
    While s is not a terminal state:
        Choose action a using epsilon-greedy policy on Q(s, a)
            With probability epsilon: choose a random action
            With probability 1 - epsilon: choose the action with max Q(s, a)
        
        Take action a, observe reward r and next state s'
        
        Update Q(s, a) using the Q-learning update rule:
            Q(s, a) = Q(s, a) + alpha * (r + gamma * max(Q(s', a')) - Q(s, a))
        
        Set s = s'
    
    Reduce epsilon (optional, to encourage exploitation over time)

Explanation of the Pseudocode

  1. Initialization: Initialize the Q-values table for all state-action pairs, setting values to zero or small random values.

  2. Epsilon-Greedy Action Selection: During each episode, the agent decides between exploration (choosing a random action) and exploitation (choosing the best-known action based on current Q-values).

  3. Action Execution and Reward Observation: After taking an action, the agent receives a reward and transitions to the next state. In a gridworld, actions might involve moving up, down, left, or right.

  4. Q-value Update: Update the Q-value for the previous state and action using the reward obtained and the maximum future Q-value from the next state.

  5. Episode End: The episode ends when the agent reaches a terminal state (such as the goal state in a gridworld) or a set number of steps are completed.

Example of a Simple Gridworld Game

Let's say we have a 4x4 grid, where:

  • The agent starts in the top-left corner (0, 0).
  • The goal is in the bottom-right corner (3, 3) with a reward of +10.
  • There are obstacles with penalties (e.g., -5).
  • Every step has a small negative reward to encourage reaching the goal quickly.

The agent will learn by exploring and updating its Q-values for each possible move in each grid cell.

Here's how it might look in action:

Gridworld:
S . . .
. X . .
. . X .
. . . G

Legend:
S - Start
G - Goal (+10 reward)
X - Obstacle (-5 penalty)
. - Empty space (-0.1 reward for each step)

Actions: Up, Down, Left, Right

After enough episodes, the agent will learn to navigate around obstacles and reach the goal while avoiding penalties.

About

a utilization of reinforcement learning on game of grid world.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages