Scalable Data Management for High-Throughput Genomics

Speaker: Uwe Roehm , The University of Sydney

Date: Thursday, December 12, 2013

Time: 4:00 PM to 5:00 PM Note: all times are in the Eastern Time Zone

Refreshments: 3:45 PM

Public: Yes

Location: 32-D463

Event Type:

Room Description:

Host: Samuel Madden

Contact: Sheila M. Marian, 617-253-1996,

Relevant URL:

Speaker URL: None

Speaker Photo:

Reminders to:

Reminder Subject: TALK: Scalable Data Management for High-Throughput Genomics

With today's DNA sequencing technology, one can sequence an individual genome within a few days for a fraction of the costs of the original Human Genome project (an estimated $3 billion over 10 years). The ultimate goal is the personal genome within a few hours as a hospital lab test, which would revolutionize modern health care and research areas such as cancer and HIV research. This also means that Genomics labs are facing several terrabytes of data per week that have to be efficiently processed.

This talk explores the potential and the current limitations of using database technology for high-throughput genomics. In particular, we are interested in supporting the initial stages of a typical high-throughput DNA sequencing pipeline. The talk gives an overview of the BioSeqDB project, in which we explored the applicability of extensible databases and SQL for declarative processing of bio-data. One specific result was a new efficient algorithm for error-correcting raw sequence data, called Blue, that combines statistical methods and scalable data processing algorithms based on k-mer consensus. Blue outperforms existing error-correction algorithms by up-to two orders of magnitude in throughput while achieving higher accuracy on both Illumina and 454 data.

About the Speaker:
Uwe Roehm is associate professor for database systems at the University of Sydney. He is a computer science graduate from the University of Passau, Germany, and received his doctoral degree in 2002 from ETH Zurich, Switzerland. His research interests are cloud data management, databases on multicore servers, data replication, and data management for bioinformatics.

Research Areas:

Impact Areas:

This event is not part of a series.

Created by Sheila M. Marian Email at Monday, December 09, 2013 at 2:02 PM.