Fast Natural Language Based Data Exploration with Samples

International Conference on Management of Data (SIGMOD)

Publication date: June 18, 2023

Shubham Agarwal, Gromit Yeuk-Yin Chan, Shaddy Garg, Tong Yu, Subrata Mitra

The ability to extract insights from large amounts of data in a timely manner is a crucial problem. Exploratory Data Analysis (EDA) is commonly used by analysts to uncover insights using a sequence of SQL commands and associated visualizations. However, in many cases, this process is carried out by non-programmers who must work within tight time constraints, such as in a marketing campaign where a marketer must quickly analyse large amounts of data to reach a target revenue. This paper presents ApproxEDA - a system that combines a natural language processing (NLP) interface for insight discovery with an underlying sample-based EDA engine. The NLP interface can convert high-level questions into contextual SQL queries of the dataset, while the backend EDA engine significantly speeds up insight discovery by selecting the most optimum sample from among many pre-created samples using various sampling strategies. We demonstrate that ApproxEDA addresses two key aspects: converting high-level NLP inputs to contextual SQL and intelligently selecting samples using a reinforcement learning agent. This protects users from diverging from their original intent of analysis, which can occur due to approximation errors in results and visualizations, while still providing optimal latency reduction through the use of samples.

Learn More

Research Area:  Adobe Research iconData Intelligence