HoloClean++ : End-to-End Error Detection and Repair with Few-Shot Learning
, University of Waterloo
Date: Tuesday, May 07, 2019
Time: 4:00 PM to 5:00 PM
Location: Star (32-D463)
Event Type: Seminar
Host: Tim Kraska, CSAIL & EECS
Contact: Sheila M. Marian, 617-253-1996, firstname.lastname@example.org
Speaker URL: None
TALK: HoloClean++ : End-to-End Error Detection and Repair with Few-Shot Learning
Abstract: The HoloClean++ framework is an end-to-end machine learning framework for data profiling and cleaning (error detection and repair). Data cleaning has been recognized as one of the main hurdles that hinder progress for effective analytics, and crippling the effort to build end-to-end analytics systems. Data scientists and practitioners spend most of their time manually repairing data, spotting errors, or writing complex code in an ad-hoc effort to automate the cleaning process. The framework has multiple successful deployments with cleaning census data, and pilots with commercial enterprises to boost the quality of source (training) data before feeding them to downstream analytics.
HoloClean++ builds two main probabilistic models: a data generation model (describing how data was intended to look like); and a pollution or realization model (describing how errors might be introduced to the intended clean data). The framework uses few-shot learning, data augmentation, and weak supervision to learn the parameters of these models, and use them to predict both error and their possible repairs.
While the idea of using statistical inference to model the joint data distribution of the underlying data is not new, the problem has been always: (1) how to scale a model with millions of data cells (corresponding to random variables); and (2) how to get enough training data to learn the complex models that are capable of accurately predicting the anomalies and the repairs. HoloClean++ tackles exactly these two problems.
In this talk, I will highlight the theoretical background as well as the engineering effort in building the system.
Bio: Ihab Ilyas is a professor in the Cheriton School of Computer Science and the NSERC-Thomson Reuters Research Chair on data quality at the University of Waterloo. His main research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, machine learning for data curation, and information extraction. Ihab is also a co-founder of Tamr, a startup focusing on large-scale data integration. He is a recipient of the Ontario Early Researcher Award, a Cheriton Faculty Fellowship, an NSERC Discovery Accelerator Award, and a Google Faculty Award, and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees, elected SIGMOD vice chair, and an associate editor of the ACM Transactions of Database Systems (TODS). He holds a PhD in computer science from Purdue University, West Lafayette.
Created by Sheila M. Marian at Monday, May 06, 2019 at 9:20 AM.