Q Learning Algorithm for Tic Tac Toe
Introduction
Q-learning is a reinforcement learning algorithm used to train agents to make optimal decisions in various environments. This article will explore how Q-learning can be applied to the classic game of Tic Tac Toe. We will discuss the implementation details, the Q-table, and the learning process.
Understanding the Game
Tic Tac Toe is a two-player game where the objective is to get three of your symbols (X or O) in a row, column, or diagonal. The game is simple but provides a suitable environment for applying reinforcement learning techniques.
The Q-Table
The core of Q-learning is the Q-table, which stores the expected future rewards for taking specific actions in different states. In Tic Tac Toe:
- **States:** Each possible board configuration is a state.
- **Actions:** The possible moves (placing an X or O) in each state.
- **Q-Values:** The table contains Q-values for each (state, action) pair, representing the estimated reward for taking that action in that state.
Implementation Steps
- Initialize Q-table: Start with a table filled with zeros or random values.
- Play Games: Have the agent play multiple games against a random opponent or another agent.
- Update Q-values: After each move, update the Q-value for the chosen action based on the following formula:
- s: Current state (board configuration)
- a: Action taken (move)
- r: Reward for taking the action (win: +1, lose: -1, draw: 0)
- s’: Next state after taking the action
- a’: Optimal action in the next state
- α: Learning rate (controls how much to update Q-values)
- γ: Discount factor (controls how much future rewards are valued)
- Choose Actions: The agent selects actions using an epsilon-greedy strategy:
- With probability ε, choose a random action (exploration)
- With probability 1-ε, choose the action with the highest Q-value for the current state (exploitation)
- Train the Agent: Repeat steps 2-4 for many games. As the agent learns, its Q-values improve, leading to better decision-making.
Q(s, a) = Q(s, a) + α [r + γ max(Q(s', a')) - Q(s, a)]
Where:
Example Q-Table (Partial)
State (Board) | Action (Move) | Q-Value |
---|---|---|
| X | | | | | | | O | |
Place X at (1, 1) | 0.75 |
| X | | | | | | | O | |
Place X at (1, 2) | -0.25 |
| | O | | X | | | | | |
Place O at (1, 1) | 0.80 |
Output
The trained agent will learn to play Tic Tac Toe optimally, consistently winning or drawing against a random opponent.
Game 1: X wins! Game 2: Draw! Game 3: O wins! ... Game 100: X wins!
Conclusion
Q-learning provides a powerful framework for teaching an agent to play Tic Tac Toe effectively. By training on many games and updating its Q-values, the agent learns to make optimal moves, maximizing its chances of winning. This approach can be adapted to other games and complex environments, demonstrating the versatility of reinforcement learning.