Scheduling ML Training on Unreliable Spot Instances

Cloud providers rent out surplus computational resources as spot instances at a deep discount. However, these cheap spot instances are revocable. When demand surges for higher priced on-demand instances, cloud providers can interrupt these spot instances after a brief alert. Such unreliability makes it challenging to utilize spot instances for many long-running jobs. However, with checkpoints and restoration, machine-learning (ML) training jobs are a good candidate to overcome this difficulty. In this paper, we formalize the problem of scheduling ML-training jobs on transient spot instances, especially from an ML researcher's view, who may have some grant/credit for renting cloud computing services for several ML training tasks. Such a researcher would need to partition the computational resources wisely to maximize outcome (or total expected utility of all jobs) while maintaining some fairness between jobs. We investigate the trade-off between low-cost/interruptible and high-cost/uninterruptible computation, by proposing a linear-programming (LP) rounding based polynomial time algorithm. Based on the LP solution, we also give an LP-based heuristic that performs well in practice. We implement and evaluate these algorithms, and are able to achieve the same utility with 23% to 48% of the budget needed with on-demand instances. Moreover, the total utility we get is close to the theoretical upper bound under various settings, indicating close to optimal performance.

Learn More

Publications

Scheduling ML Training on Unreliable Spot Instances

Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing Companion (UCC)

Publication date: February 7, 2022

Sheng Yang, Samir Khuller, Sunav Choudhary, Subrata Mitra, Kanak Mahadik

Research Areas: AI & Machine Learning Systems & Languages