Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions

Authors: 

Sebastien Levy, Randolph Yao, Youjiang Wu, and Yingnong Dang, Microsoft Azure; Peng Huang, Johns Hopkins University; Zheng Mu, Microsoft Azure; Pu Zhao, Microsoft Research; Tarun Ramani, Naga Govindaraju, and Xukun Li, Microsoft Azure; Qingwei Lin, Microsoft Research; Gil Lapid Shafriri and Murali Chintalapati, Microsoft Azure

Abstract: 

When a failure occurs in production systems, the highest priority is to quickly mitigate it. Despite its importance, failure mitigation is done in a reactive and ad-hoc way: taking some fixed actions only after a severe symptom is observed. For cloud systems, such a strategy is inadequate. In this paper, we propose a preventive and adaptive failure mitigation service, Narya, that is integrated in a production cloud, Microsoft Azure's compute platform. Narya predicts imminent host failures based on multi-layer system signals and then decides smart mitigation actions. The goal is to avert VM failures. Narya's decision engine takes a novel online experimentation approach to continually explore the best mitigation action. Narya further enhances the adaptive decision capability through reinforcement learning. Narya has been running in production for 15 months. It on average reduces VM interruptions by 26% compared to the previous static strategy.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {258943,
author = {Sebastien Levy and Randolph Yao and Youjiang Wu and Yingnong Dang and Peng Huang and Zheng Mu and Pu Zhao and Tarun Ramani and Naga Govindaraju and Xukun Li and Qingwei Lin and Gil Lapid Shafriri and Murali Chintalapati},
title = {Predictive and Adaptive Failure Mitigation to Avert Production Cloud {VM} Interruptions},
booktitle = {14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)},
year = {2020},
isbn = {978-1-939133-19-9},
pages = {1155--1170},
url = {https://www.usenix.org/conference/osdi20/presentation/levy},
publisher = {USENIX Association},
month = nov
}

Presentation Video