How to Implement Momentum and Decay Correctly – SGD

Introduction

Stochastic Gradient Descent (SGD) is a widely used optimization algorithm in machine learning. While simple and effective, SGD can sometimes struggle with oscillations and slow convergence. To address these issues, momentum and decay are commonly used techniques that enhance SGD’s performance.

Understanding Momentum

What is Momentum?

Momentum introduces a “memory” to the gradient updates. It accumulates a fraction of the previous gradient update and adds it to the current gradient update. This helps the optimizer “remember” the direction of movement and accelerate learning in that direction, smoothing out oscillations.

Implementing Momentum

Here’s how to implement momentum in SGD:

  1. Initialize the velocity (v) to 0.
  2. For each iteration:
    • Calculate the gradient (g).
    • Update the velocity: v = βv + g (where β is the momentum coefficient).
    • Update the parameters: θ = θ – αv (where α is the learning rate).

Typical momentum coefficient values (β) range from 0.5 to 0.9. Higher values lead to more “inertia” and faster convergence but can overshoot optimal points.

Understanding Weight Decay

What is Weight Decay?

Weight decay (also known as L2 regularization) penalizes large weights during training. This encourages the model to find solutions with smaller, more distributed weights, reducing overfitting.

Implementing Weight Decay

Incorporating weight decay into SGD is straightforward:

  1. After each parameter update, scale the weight by a factor of (1 – λ), where λ is the weight decay coefficient.

The weight decay coefficient (λ) is a small value, typically around 1e-4 or 1e-5. Larger values increase the penalty on large weights, potentially leading to smaller, simpler models.

Combining Momentum and Decay

Both momentum and decay can be effectively combined in SGD. The combined algorithm updates parameters as follows:

  1. Initialize the velocity (v) to 0.
  2. For each iteration:
    • Calculate the gradient (g).
    • Update the velocity: v = βv + g.
    • Update the parameters: θ = (1 – λ)θ – αv.

Example Implementation in Python

Code

 import numpy as np def sgd_momentum_decay(params, grads, lr, momentum, decay): """ Implements SGD with momentum and weight decay. Args: params: A dictionary of parameters. grads: A dictionary of gradients. lr: Learning rate. momentum: Momentum coefficient. decay: Weight decay coefficient. Returns: Updated parameters. """ velocity = {} for name in params: if name not in velocity: velocity[name] = np.zeros_like(params[name]) velocity[name] = momentum * velocity[name] + grads[name] params[name] = (1 - decay) * params[name] - lr * velocity[name] return params 

Usage

 # Example usage: params = {'w1': np.array([1.0, 2.0]), 'w2': np.array([3.0, 4.0])} grads = {'w1': np.array([0.1, 0.2]), 'w2': np.array([0.3, 0.4])} lr = 0.01 momentum = 0.9 decay = 1e-4 updated_params = sgd_momentum_decay(params, grads, lr, momentum, decay) print(updated_params) 

Output

 {'w1': array([ 0.9999, 1.9998]), 'w2': array([ 2.9997, 3.9996])} 

Conclusion

Implementing momentum and decay correctly in SGD can significantly improve the convergence speed and stability of the optimization process. These techniques are valuable additions to your machine learning toolkit for training neural networks and other models.

Leave a Reply

Your email address will not be published. Required fields are marked *