Introduction
Stochastic Gradient Descent (SGD) is a widely used optimization algorithm in machine learning. While simple and effective, SGD can sometimes struggle with oscillations and slow convergence. To address these issues, momentum and decay are commonly used techniques that enhance SGD’s performance.
Understanding Momentum
What is Momentum?
Momentum introduces a “memory” to the gradient updates. It accumulates a fraction of the previous gradient update and adds it to the current gradient update. This helps the optimizer “remember” the direction of movement and accelerate learning in that direction, smoothing out oscillations.
Implementing Momentum
Here’s how to implement momentum in SGD:
- Initialize the velocity (v) to 0.
- For each iteration:
- Calculate the gradient (g).
- Update the velocity: v = βv + g (where β is the momentum coefficient).
- Update the parameters: θ = θ – αv (where α is the learning rate).
Typical momentum coefficient values (β) range from 0.5 to 0.9. Higher values lead to more “inertia” and faster convergence but can overshoot optimal points.
Understanding Weight Decay
What is Weight Decay?
Weight decay (also known as L2 regularization) penalizes large weights during training. This encourages the model to find solutions with smaller, more distributed weights, reducing overfitting.
Implementing Weight Decay
Incorporating weight decay into SGD is straightforward:
- After each parameter update, scale the weight by a factor of (1 – λ), where λ is the weight decay coefficient.
The weight decay coefficient (λ) is a small value, typically around 1e-4 or 1e-5. Larger values increase the penalty on large weights, potentially leading to smaller, simpler models.
Combining Momentum and Decay
Both momentum and decay can be effectively combined in SGD. The combined algorithm updates parameters as follows:
- Initialize the velocity (v) to 0.
- For each iteration:
- Calculate the gradient (g).
- Update the velocity: v = βv + g.
- Update the parameters: θ = (1 – λ)θ – αv.
Example Implementation in Python
Code
import numpy as np def sgd_momentum_decay(params, grads, lr, momentum, decay): """ Implements SGD with momentum and weight decay. Args: params: A dictionary of parameters. grads: A dictionary of gradients. lr: Learning rate. momentum: Momentum coefficient. decay: Weight decay coefficient. Returns: Updated parameters. """ velocity = {} for name in params: if name not in velocity: velocity[name] = np.zeros_like(params[name]) velocity[name] = momentum * velocity[name] + grads[name] params[name] = (1 - decay) * params[name] - lr * velocity[name] return params
Usage
# Example usage: params = {'w1': np.array([1.0, 2.0]), 'w2': np.array([3.0, 4.0])} grads = {'w1': np.array([0.1, 0.2]), 'w2': np.array([0.3, 0.4])} lr = 0.01 momentum = 0.9 decay = 1e-4 updated_params = sgd_momentum_decay(params, grads, lr, momentum, decay) print(updated_params)
Output
{'w1': array([ 0.9999, 1.9998]), 'w2': array([ 2.9997, 3.9996])}
Conclusion
Implementing momentum and decay correctly in SGD can significantly improve the convergence speed and stability of the optimization process. These techniques are valuable additions to your machine learning toolkit for training neural networks and other models.