Visualization-Aware Sampling

A data reduction technique
for fast visualization of big data

Big Data, Poor Latency

You have billions of data points, but are you frustrated simply because they are just too big to visualize? This is a common problem we encounter when we work with large database: my dataset is simply too big to interactively explore and gain useful insights from them.

Our proposed approach, Visualizaion-Aware Sampling (VAS), makes the interactive visulization of big data possible by reducing the number of data points, but minimzing the quality loss stemming from the data reduction process.

A Quick Example

To aid illustration, we used VAS to sample only 1000 points out of more than 2 billion data points, and visulized those points using a scatter plot below. You can see that, even though it is showing only a tiny fraction of the original dataset (less than 0.0005% !!!), it maintains many important structures of the original dataset; thus, we can grasp that the data points mainly reside on the continents of the world.

For comparison, we also visualize below the same number of data points (1000) obtained from a uniform random sampling. The time it takes for visualizing the data is almost identical (since the number of the data points is equal as 10K in both cases); however, they produce significantly different results.

Publications

Yongjoo Park, Michael Cafarella, Barzan Mozafari. “Visualization-Aware Sampling for Very Large Databases.” ICDE 2016. (pdf)