SampleClean: Fast and Accurate Query Processing on Dirty Data

Speaker: Jiannan Wang , UC Berkeley

Date: Thursday, October 09, 2014

Time: 3:00 PM to 4:00 PM Note: all times are in the Eastern Time Zone

Refreshments: 2:45 PM

Public: Yes

Location: Hewlett Room (32-G882)

Event Type:

Room Description:

Host: Samuel Madden

Contact: Sheila M. Marian, 617-253-1996,

Relevant URL:

Speaker URL: None

Speaker Photo:

Reminders to:

Reminder Subject: TALK: SampleClean: Fast and Accurate Query Processing on Dirty Data

The vision of AMPLab is to integrate Algorithms (Machine Learning), Machines (Cloud Computing) and People (Crowdsourcing) to make sense of Big Data. In the past several years, the lab has developed a variety of open-source softwares (e.g., Spark and MLBase) to integrate the three resources. For the People part, one of our main focuses is on data cleaning. Real-world data is often “dirty”. Data cleaning is usually a tedious and time-consuming process which requires a lot of human work. In the AMPLab, we have exploited the use of crowdsourcing to reduce the human cost. While crowdsourcing makes data cleaning more scalable, it is still highly inefficient for large datasets. To overcome this limitation, we started the SampleClean project last year. The project aims to investigate how to obtain accurate query results from dirty data, by only cleaning a small sample of the data. We achieved this goal by marrying data cleaning with sampling-based approximate query process! ing, and addressing many challenging statistical issues. We build a new system that combines our work on crowdsourcing data cleaning and SampleClean query processing. An initial version of the system has shown that our system can help users to obtain very accurate query results on dirty data, at significantly reduced cleaning cost.

Jiannan Wang is a postdoc in the AMPLab at UC Berkeley, where he works with Michael Franklin and leads the SampleClean project. His research is focusing on developing algorithms and systems for extracting value from “dirty" data. He obtained a PhD from the Computer Science Department at Tsinghua University. During his PhD, he h! as been a visiting scholar at Chinese University of Hong Kong and UC Berkeley, and an intern at Qatar Computing Research Institute. His PhD research work was supported by Google PhD Fellowship, Boeing Scholarship, and “New PhD Researcher Award” by Chinese Ministry of Education. His PhD dissertation has won the China Computer Federation (CCF) Distinguished Dissertation Award. His similarity-join algorithm has won the first place of EDBT String Similarity Search/Join Competition.

Research Areas:

Impact Areas:

This event is not part of a series.

Created by Sheila M. Marian Email at Wednesday, October 01, 2014 at 1:45 PM.