Thesis Defense: Efficient Physiological Time Series Retrieval and Prediction via Locality-Sensitive Hashing
Y. Bryce Kim
, Anyscale Learning for All Group, CSAIL
Date: Tuesday, May 09, 2017
Time: 3:00 PM to 4:00 PM
Location: 32-D463 (Star)
Host: Una-May O'Reilly, Anyscale Learning for All Group, CSAIL
Contact: Bryce Kim, firstname.lastname@example.org
Speaker URL: None
TALK: Thesis Defense: Efficient Physiological Time Series Retrieval and Prediction via Locality-Sensitive Hashing
Abstract: The amount of time series data collected in the medical community has recently been exploding due to widespread affordable sensors and storage devices. However, while it provides enormous opportunities for machine learning to make significant impacts, the massive repositories of such physiological time series data are largely untapped due to their granular detail and overwhelming size. Besides scale, fast yet accurate processing of physiological waveform data is desired in medical practice, especially in time-critical settings such as the intensive care unit (ICU). Efficiently leveraging these massive datasets is a key challenge that, when resolved, will support a new paradigm of scientific discovery and operational innovation in medicine.
In this thesis, we develop highly efficient similarity based methods that make it practical to search massive physiological time series repositories to rapidly identify waveforms similar to those from a given individual. We call this concept "patients with trajectories like mine." Our goal is to exploit rapid similar waveform retrieval to enable critical event prediction in the ICU setting. In order to achieve this goal, we propose to apply locality-sensitive hashing (LSH), which supports a very fast approximate nearest neighbor search in high dimensions. We empirically demonstrate that LSH based retrieval and prediction methods vastly speed up querying time while sacrificing only a trivial amount of accuracy as a cost.
Despite being fast and accurate, the generic LSH has two shortcomings. First, it is capable of utilizing only one similarity measure at a time. To overcome this limit, we introduce Stratified LSH (SLSH) which finds similarity among the data from a more integrated perspective by employing multiple distance metrics in one framework. SLSH is essentially a dual-level hierarchical LSH where each LSH layer is associated with a distinct distance metric capturing a unique facet of similarity. The second shortcoming and the main bottleneck of the generic LSH is that it involves exhaustive distance calculations as a subroutine when short-listing the candidate set to find the final nearest neighbors. To surmount this, we propose Collision Frequency LSH (CFLSH) which short-lists the candidate set by simply counting the frequency of collision based on the key idea that the more frequently an element and a query collide across multiple LSH hash tables, the more similar they are. We show that with SLSH and CFLSH, we improve the efficiency of LSH in terms of both prediction accuracy and querying speed.
We demonstrate our proposed methods on the mean arterial blood pressure dataset extracted from the MIMIC II database in the context of predicting acute hypotensive episodes in ICU. To examine the generality of our methods with respect to scaling, we validate our methods on datasets with various dimensions and item counts.
Created by Bryce Kim at Thursday, May 04, 2017 at 3:24 PM.