Modern software systems’ increasing complexity and scale makes it challenging to accurately detect system issues and outages, which have been tackled as an anomaly detection task. Conventionally, such anomalous events barely happen, and annotating them is time-consuming and impractical in big data streams. Even with automated anomaly detection, resolving issues promptly is a remaining challenge that can only be done by providing specific contexts such as root causes, target/affected services, and more. To address these fundamentally important problems, we present Grid Transformer (GT), a framework designed to detect and explain log anomalies in an unsupervised setting. We first train an Auto-Encoder model to generate pseudo labels. Then, we train the proposed grid transformer that not only predicts anomalies but also generates why a particular instance is an anomaly. Through extensive experiments, we demonstrate the effectiveness of our approach where it is shown to outperform the other log anomaly detection models by 20% while also able to generate time-wise and message-wise explanations of the anomalies.
Learn More