Adaptive optimizers, which adjust the learning rate for individual parameters, have become the standard for training deep neural networks. One such optimizer is AdamW, a popular adaptive method that maintains two optimizer state values (momentum and variance) per parameter, doubling the model’s memory usage during training. Many proposed memory efficient optimizers claim to match AdamW’s performance but lack its desirable qualities such as robustness to learning rate changes. This quality is especially desirable when pre-training LLMs, where experimenting with different combinations of hyperparamters to attain the ideal setting is infeasible due to time, cost, and compute constraints. We propose Eve, a Memory Efficient AdaptiVe Moment Estimation algorithm that saves memory by reducing the variance term while also preserving AdamW’s desirable properties across different training settings. We finetune Llama 2 70B on 64 GPUs and show memory savings of 20% over AdamW. We also compare our method to a recent well-received memory-efficient optimizer called Adam-mini and demonstrate better training stability across various learning rates.
Slides will be available for download here after the presentation.
Aditya Tomar is a second-year undergraduate student in the Electrical Engineering & Computer Sciences Department at UC Berkeley. His research with the Parallel Software and Systems Group under Dr. Abhinav Bhatele at UMD includes efficient optimization methods for deep learning, algorithms for distributed training of large models across thousands of accelerators, and performance analysis tools for parallel programs.