Yongjoo Park    pyongjoo@umich.edu

I am a Ph.D. candidate in Computer Science and Engineering at the University of Michigan, Ann Arbor. My research interests are (i) building real-time big data analytics systems and (ii) developing statistical and machine learning algorithms for smarter computational engines in those systems. During my PhD study, I developed algorithms for various data analytics applications including approximate query processing, data visualizations, searching in high-dimensional space.

My advisors are Michael Cafarella and Barzan Mozafari, and I am part of Database Group at the University of Michigan. My curriculum vitae is available here. My masters and Ph.D. studies have been supported additionally by Jeongsong Cultural Foundation and Kwanjeong Educational Foundation, respectively. I am best reached at my email: pyongjoo@umich.edu.


Data Analytics System that Becomes Smarter over Time (SIGMOD'17, CIDR'17)

Database learning is our vision to building an intelligent database system that makes use of its own knowledge or understanding on the underlying data stored in the database for producing enhanced answers to new queries. This general vision has been applied to building an approximate query processing (AQP) system that becomes smarter as it process more queries. Our AQP system combines the (relatively less accurate) approximate answer to a new query with its model on the underlying data to produce more accurate approximate answers, where the model is built based on the approximate answers to the past queries. As a result, our system greatly reduces the expected errors (and actual errors as well) of the approximate answers generated from a popular sampling-based AQP engines. This error reduction capability also translates into query processing speedups when targeting a certain level of AQP accuracy.

Presentations:
North East Database Day 2016, Oral, MIT
Midwest Big Data Opportunities and Challenges Workshop 2016, Chicago
The biennial Conference on Innovative Data Systems Research (CIDR) 2017 Gongshow
ACM SIGMOD International Conference on Management of Data (SIGMOD) 2017

Approximate Searching in High-dimensional Space (VLDB'16)

Our hashing-based searching algorithm, Neighbor-Sensitive Hashing, provides significant improvements in the accuracy of approximate searching and, as a result, the speed of the search process for the identical target serach accuracy, compared to a decade of research in hashing-based searching algorithms started from the famous Locality-Sensitive Hashing. The performance improvements of our algorithm stems from the fact that our hashing functions are designed only for capturing the information on only a relatively small number of similar items, compared to the existing work that captures the similarity (or distance) information on all pairs of data items. As a result, our searching algorithm excels when the number of similar items is relatively small compared to the total number of items in a database, which is mostly the true for searching applications.

Presentations:
International conference on Very Large Data Bases 2016 Delhi, India
3rd Workshop on Web-scale Vision and Social Media at ICCV 2015, Santiago, Chile

Sampling for Data Visualization Techniques (ICDE'16)

We investigates an optimal sampling (or equivalently, data reduction) method for visualizing large-scale database using scatter plots (or similar). This project starts by observing that traditional sampling methods such as uniform random sampling and stratified sampling show limited performance in generating scatter plots that look (visually) similar to the original plot (which is obtainable by visualizing entire database after hours of processing). In contrast, our newly developed method is specialized in approximating the visual quality of scatter plots, and provides much higher quality visualizations.

Our demo with 2 billion GPS recordings from the open street map project is available here

Presentations: International Conference in Data Engineering (ICDE) 2016, Helsinki, Finland

Feature Engineering System for Machine Learning (CIDR'13)

This work identifies the necessity of the system dedicated to feature engineering process essential for building high-quality machine learning systems. This idea is further pursued by Mike Anderson, resulting in a concrete proposal for the feature engineering system.

Presentations: The biennial Conference on Innovative Data Systems Research (CIDR) 2013, Asilomar, California