GitHub - zw3413/q_learning: a utilization of reinforcement learning on game of grid world.

$\epsilon -greedy$

$$ Q(s,a) = (1- \alpha) * Q(s,a) + \alpha * ( r + \gamma * \max \limits_{a'} Q(s', a')) $$

Exploration Function

$$ Q(s,a) = (1- \alpha) * Q(s,a) + \alpha * ( r + \gamma * \max \limits_{a'} f (Q(s', a'), N(s',a')) \\ f(Q(s,a),N(s,a)) = Q(s,a) + \frac{k}{\sqrt{N(s,a)+1}} $$

Q-learning with an epsilon-greedy exploration strategy is a popular reinforcement learning algorithm. The objective of Q-learning is to find an optimal policy that maximizes the cumulative reward for an agent interacting with an environment. The epsilon-greedy strategy is a method used within Q-learning to balance exploration (trying new actions) and exploitation (choosing the best-known actions) in the action selection process.

Steps in Q-Learning with Epsilon-Greedy Exploration

Initialize Q-values: For each state-action pair, initialize a Q-value (often to zero). These values will be updated to represent the expected reward of taking a particular action in a given state.
Choose an Action: In each state, use the epsilon-greedy strategy to choose an action:
- With probability ( \epsilon ): Choose a random action (exploration).
- With probability ( 1 - \epsilon ): Choose the action with the highest Q-value for that state (exploitation).
Take Action and Observe Reward: Perform the chosen action, observe the resulting reward and the next state.
Update Q-value: Update the Q-value for the state-action pair using the Q-learning update rule: [ Q(s, a) = (1-\alpha)Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') \right) ] where:
- ( s ) and ( a ) are the current state and action.
- ( r ) is the reward received after taking action ( a ).
- ( s' ) is the next state after taking action ( a ).
- ( a' ) represents all possible actions from ( s' ).
- ( \alpha ) is the learning rate (controls how much new information overrides old information).
- ( \gamma ) is the discount factor (determines the importance of future rewards).
Repeat: Repeat steps 2-4 for a specified number of episodes or until the agent reaches a stopping criterion.

Pseudocode for Q-Learning with Epsilon-Greedy in a Gridworld Environment

Here's the pseudocode using a gridworld game where the agent's objective is to reach a goal while avoiding obstacles and collecting rewards.

Initialize Q(s, a) arbitrarily for all states s and actions a
Set parameters epsilon, alpha, and gamma

For each episode:
    Initialize the starting state s
    
    While s is not a terminal state:
        Choose action a using epsilon-greedy policy on Q(s, a)
            With probability epsilon: choose a random action
            With probability 1 - epsilon: choose the action with max Q(s, a)
        
        Take action a, observe reward r and next state s'
        
        Update Q(s, a) using the Q-learning update rule:
            Q(s, a) = Q(s, a) + alpha * (r + gamma * max(Q(s', a')) - Q(s, a))
        
        Set s = s'
    
    Reduce epsilon (optional, to encourage exploitation over time)

Explanation of the Pseudocode

Initialization: Initialize the Q-values table for all state-action pairs, setting values to zero or small random values.
Epsilon-Greedy Action Selection: During each episode, the agent decides between exploration (choosing a random action) and exploitation (choosing the best-known action based on current Q-values).
Action Execution and Reward Observation: After taking an action, the agent receives a reward and transitions to the next state. In a gridworld, actions might involve moving up, down, left, or right.
Q-value Update: Update the Q-value for the previous state and action using the reward obtained and the maximum future Q-value from the next state.
Episode End: The episode ends when the agent reaches a terminal state (such as the goal state in a gridworld) or a set number of steps are completed.

Example of a Simple Gridworld Game

Let's say we have a 4x4 grid, where:

The agent starts in the top-left corner (0, 0).
The goal is in the bottom-right corner (3, 3) with a reward of +10.
There are obstacles with penalties (e.g., -5).
Every step has a small negative reward to encourage reaching the goal quickly.

The agent will learn by exploring and updating its Q-values for each possible move in each grid cell.

Here's how it might look in action:

Gridworld:
S . . .
. X . .
. . X .
. . . G

Legend:
S - Start
G - Goal (+10 reward)
X - Obstacle (-5 penalty)
. - Empty space (-0.1 reward for each step)

Actions: Up, Down, Left, Right

After enough episodes, the agent will learn to navigate around obstacles and reach the goal while avoiding penalties.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Q_learning_epsilon&exploration.py		Q_learning_epsilon&exploration.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Steps in Q-Learning with Epsilon-Greedy Exploration

Pseudocode for Q-Learning with Epsilon-Greedy in a Gridworld Environment

Explanation of the Pseudocode

Example of a Simple Gridworld Game

About

Uh oh!

Releases

Packages

Languages

zw3413/q_learning

Folders and files

Latest commit

History

Repository files navigation

Steps in Q-Learning with Epsilon-Greedy Exploration

Pseudocode for Q-Learning with Epsilon-Greedy in a Gridworld Environment

Explanation of the Pseudocode

Example of a Simple Gridworld Game

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages