From Large to Small Datasets: Size Generalization for Clustering Algorithm Selection

Speaker: Ellen Vitercik , Stanford University

Date: Tuesday, March 19, 2024

Time: 4:15 PM to 5:15 PM Note: all times are in the Eastern Time Zone

Public: Yes

Location: 32-G449

Event Type: Seminar

Room Description: Refreshments at 4:00pm

Host: Costis Daskalakis, CSAIL MIT

Contact: Joanne Talbot Hanley, 617-253-6054,

Relevant URL:

Speaker URL: None

Speaker Photo:

Reminders to:,

Reminder Subject: TALK: TOC Seminar: Ellen Vitercik "From Large to Small Datasets: Size Generalization for Clustering Algorithm Selection," Tuesday, March 19 at 4:15pm in 32-G449 (Refreshments at 4:00pm)

Abstract: In clustering algorithm selection, we are given a massive dataset and must efficiently select which clustering algorithm to use. We study this problem in a semi-supervised setting, with an unknown ground-truth clustering that we can only access through expensive oracle queries. Ideally, the clustering algorithm's output will be structurally close to the ground truth. We approach this problem by introducing a notion of size generalization for clustering algorithm accuracy. We identify conditions under which we can (1) subsample the massive clustering instance, (2) evaluate a set of candidate algorithms on the smaller instance, and (3) guarantee that the algorithm with the best accuracy on the small instance will have the best accuracy on the original big instance. We verify these findings both theoretically and empirically.
This is joint work with Vaggos Chatziafratis and Ishani Karmarkar.

Research Areas:

Impact Areas:

See other events that are part of the Theory of Computation (ToC) Seminar 2024.

Created by Joanne Talbot Hanley Email at Monday, March 11, 2024 at 9:27 AM.