## Harry Lang: Coresets for k-Means-Type Problems on Streams

Speaker: Harry Lang , Johns Hopkins University

Date: Wednesday, July 05, 2017

Time: 4:00 PM to 5:00 PM Note: all times are in the Eastern Time Zone

Public: Yes

Location: 32-G575

Event Type:

Room Description:

Host: Pritish Kamath and Akshay Degwekar, CSAIL MIT

Relevant URL:

Speaker URL: None

Speaker Photo:
None

Reminders to: seminars@csail.mit.edu, theory-seminars@csail.mit.edu

Reminder Subject: TALK: RESCHEDULED: July 5, 2017: Harry Lang: Coresets for k-Means-Type Problems on Streams

Abstract: Let f be a non-negative symmetric dissimilarity measure. Given sets of points P (the input data) and C (a "query"), we define F(P,C) = \sum_{p \in P} \min_{c \in C} f(p,c). An example that fits into this framework is k-means clustering where f = distance^2 and we restrict |C| = k.

An $\epsilon$-coreset for P is a set Q such that F(Q,C) is within a factor of $(1 + \epsilon)$ of F(P,C) for any query C. For data reduction, the goal is to construct a coreset Q of minimal cardinality. We introduce improved coreset constructions for a wide class of functions that obey a weak form of the triangle inequality: this includes k-means, k-median, Tukey's bi-weight function, and the Huber loss function.

In the streaming model, the input P (a set of n points) arrives sequentially and algorithms attempt to maintain a coreset using a minimal amount of memory. We introduce a new technique for maintaining coresets in the streaming model with O(1) overhead. For example, the previous state-of-the-art for maintaining an $\epsilon$-coreset for metric k-means on a stream required the storage of $O(\epsilon^{-4} k \log k \log^6 n)$ points whereas our method requires $O(\epsilon^{-2} k \log k \log n)$ points.

Joint work with Vladimir Braverman and Dan Feldman.

Research Areas:

Impact Areas:

See other events that are part of the Algorithms and Complexity Seminar Series 2016/2017.

Created by Rebecca Yadegar at Monday, June 05, 2017 at 4:50 PM.