Linear Probing Revisited: Tombstones Mark the Demise of Primary Clustering

Speaker: William Kuszmaul , MIT

Date: Wednesday, February 16, 2022

Time: 4:00 PM to 5:00 PM Note: all times are in the Eastern Time Zone

Public: Yes


Event Type: Seminar

Room Description:

Host: Julian Shun, MIT CSAIL

Contact: Linda Lynch,

Relevant URL:

Speaker URL:

Speaker Photo:

Reminders to:,,,,

Reminder Subject: TALK: Linear Probing Revisited: Tombstones Mark the Demise of Primary Clustering

******************IMPORTANT NOTE ABOUT REGISTRATION******************
- The registration link for Spring 2022 is the same as the link from Fall 2021.
- Please save the Zoom link that you receive after you register. This link will stay the same for all subsequent Fast Code seminars.
- Zoom does not recognize a second registration, and will not send out the link a second time. The organizers will not be notified of any second registration.
- If you have any problems with registration, please contact by 3:30pm on the day of the seminar, so that we can try to resolve it before the seminar begins.

Abstract: The linear-probing hash table is one of the oldest and most widely used data structures in computer science. However, linear probing also famously comes with a major drawback: as soon as the hash table reaches a high memory utilization, elements within the hash table begin to cluster together, causing insertions to become slow. This phenomenon, now known as "primary clustering", was first captured by Donald Knuth in 1963; at a load factor of $1 - 1/x$, the expected time per insertion becomes $\Theta(x^2)$, rather than the more desirable $\Theta(x)$.

We show that there is more to the story than the classic analysis would seem to suggest. It turns out that small design decisions in how deletions are implemented have dramatic effects on the asymptotic performance of insertions. If these design decisions are made correctly, then even a hash table that is continuously at a load factor $1 - \Theta(1/x)$ can achieve average insertion time $\tilde{O}(x)$. A key insight is that the tombstones left behind by deletions cause a surprisingly strong "anti-clustering" effect, and that when insertions and deletions are one-for-one, the anti-clustering effects of deletions actually overpower the clustering effects of insertions.

We also present a new variant of linear probing, which we call "graveyard hashing", that completely eliminates primary clustering on any sequence of operations. If, when an operation is performed, the current load factor is $1 - 1/x$ for some $x$, then the expected cost of the operation is $O(x)$. One corollary is that, in the external-memory model with a data block size of $B$, graveyard hashing offers the following remarkable guarantee: at any load factor $1-1/x$ satisfying $x = o(B)$, graveyard hashing achieves $1 + o(1)$ expected block transfers per operation. Past external-memory hash tables have only been able to offer a $1+o(1)$ guarantee when the block size $B$ is at least $\Omega(x^2)$.

Based on joint work with Michael A. Bender and Bradley C. Kuszmaul.
Appeared in FOCS 2021

Bio: William Kuszmaul is a 4th-year PhD student in MIT's EECS department, where he is advised by Charles E. Leiserson. William's research focuses on how we can use modern probabilistic arguments to get improved bounds for classical problems in algorithms and data structures. He is a 2018 Hertz Fellow as well as a 2018 NSF GRFP fellow. Prior to attending MIT, William received his bachelor's degrees in mathematics and computer science at Stanford University.

Research Areas:
Algorithms & Theory, Systems & Networking

Impact Areas:

This event is not part of a series.

Created by Julian J. Shun Email at Wednesday, February 09, 2022 at 10:46 AM.