ViSRE: A Unified Visual Analysis Dashboard for Proactive Cloud Outage Management

IEEE Working Conference on Software Visualization (VISSOFT)

Publication date: October 2, 2022

Paula Kayongo, Jane Hoffswell, Shiv Saini, Shaddy Garg, Eunyee Koh, Haoliang Wang, Tom Jacobs

Efficient outage detection and remediation is crucial for effectively operating cloud computing systems. To remediate outages, system engineers must quickly identify the causal relationships between metrics and correlate events across multiple monitoring tools. In practice, this process largely remains reactive due to the complexity and general lack of interpretability within such monitoring environments. This work presents ViSRE: an integrated visual analytics system that integrates causal and predictive models with interactive visualizations to aid in proactive cloud outage management. We develop enhanced node representations for our causal graph representation to support system engineers in performing root cause analysis and reasoning about causality chains in multi-dimensional temporal data. We report the results of a quantitative assessment of the proposed predictive models, which show good performance guarantees. To evaluate and refine our system, we conduct a study with six cloud system engineers who verify that our proposed techniques can support proactive cloud maintenance by intuitively displaying temporal relationships between predicted and raw data. By correlating and presenting data from disparate sources, ViSRE also reduces context switching costs and reduces the time spent on manually correlating events during remediation of time-critical outages.

Learn More

Research Area:  Adobe Research iconHuman Computer Interaction