Augmenting k-mer sketching for (meta)genomic sequence comparisons

Speaker: William Yu , Carnegie Mellon University (CMU)

Date: Wednesday, October 18, 2023

Time: 11:30 AM to 1:00 PM Note: all times are in the Eastern Time Zone

Public: Yes

Location:

Event Type: Seminar

Room Description:

Host: Bonnie Berger, CSAIL MIT

Contact: Shuvom Sadhuka, ssadhuka@mit.edu

Relevant URL: https://mit.zoom.us/j/93513735220

Speaker URL: None

Speaker Photo:
None

Reminders to: seminars@csail.mit.edu, bioinfo-seminar@lists.csail.mit.edu

Reminder Subject: TALK: Augmenting k-mer sketching for (meta)genomic sequence comparisons

Over the last decade, k-mer sketching (e.g. minimizers or MinHash) to create succinct summaries of long sequences has proven effective at improving the speed of sequence comparisons. However, rigorously characterizing the accuracy of these techniques has been more difficult. In this talk, I'll touch on three results that showcase some of the modern theoretical developments and practical applications of theory to building faster sequence comparison tools for metagenomics.

We begin by rigorously providing average-case guarantees for the popular seed-chain-extend heuristic for pairwise sequence alignment under a random substitution model, showing that it is accurate and runs in close to O(n log n) time for similar sequences. Then, we will turn our focus to metagenomics: our new tool skani computes average nucleotide identity (ANI) using sparse approximate alignments, and is both more accurate and over 20 times faster than the current state-of-the-art FastANI for comparing incomplete, fragmented MAGs (metagenome assembled genomes). This was enabled by Belbasi, et al.'s work showing that minimizers are biased Jaccard estimators, whereas other k-mer sketching does not have that drawback. Finally, we will introduce sylph (unpublished work), which enables fast and accurate database search to find nearest neighbor genomes (in ANI space) of low-coverage sequenced samples by using a combination of k-mer sketching with a zero-inflated Poisson correction (45x faster than MetaPhlAn for screening databases).

All of the work in this talk is joint with my brilliant PhD student Jim Shaw.

Shaw J, Yu YW. Proving sequence aligners can guarantee accuracy in almost O (m log n) time through an average-case analysis of the seed-chain-extend heuristic. Genome Research (2023) 33 (7), 1175-1187 Shaw J, Yu YW. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature Methods (2023).

Zoom link: https://mit.zoom.us/j/93513735220

Research Areas:

Impact Areas:

See other events that are part of the Bioinformatics Seminar Series 2023.

Created by Jose Abola Email at Friday, September 22, 2023 at 10:24 AM.