Thesis Defense: Haoyang Zeng, "Machine Learning Models for Functional Genomics and Therapeutic Design"

Speaker: Haoyang Zeng , MIT CSAIL - Gifford Lab

Date: Thursday, April 25, 2019

Time: 10:00 AM to 12:00 PM

Public: Yes

Location: 32-G575

Event Type: Thesis Defense

Room Description:

Host: Professor David Gifford, MIT CSAIL

Contact: Linda Lynch, 617 715 2459,

Relevant URL:

Speaker URL:

Speaker Photo:

Reminders to:

Reminder Subject: TALK: Haoyang Zeng, Thesis Defense: "Machine Learning Models for Functional Genomics and Therapeutic Design"


Due to the limited size of training data available, machine learning models for biology have remained rudimentary and inaccurate despite the significant advance in machine learning research. With the recent advent of high-throughput sequencing technology, an exponentially growing number of genomic and proteomic datasets have been generated. These large-scale datasets admit the training of high-capacity machine learning models to characterize sophisticated features and produce accurate predictions on unseen examples. In this thesis, we attempt to develop advanced machine learning models for functional genomics and therapeutics design, two areas with ample data deposited in public databases and tremendous clinical implications. The shared theme of these models is to learn how the composition of a biological sequence encodes a functional phenotype and then leverage such knowledge to provide insight for target discovery and therapeutic design.

First, we design three machine learning models that predict transcription factor binding and DNA methylation, two fundamental epigenetic phenotypes closely tied to gene regulation, from DNA sequence alone. We show that these epigenetic phenotypes can be well predicted from the sequence context. Moreover, the predicted change in phenotype between the reference and alternate allele of a genetic variant accurately reflect its functional impact and improves the identification of regulatory variants causal for complex diseases.

Second, we devise two machine learning models that improve the prediction of peptides displayed by the major histocompatibility complex (MHC) on the cell surface. Computational modeling of peptide-display by MHC is central in the design of peptide-based therapeutics. Our first machine learning model introduces the capacity to quantify uncertainty in the computational prediction and proposes a new metric for peptide prioritization that reduces false positives in high-affinity peptide design. The second model improves the state-of-the-art performance in MHC-ligand prediction by employing a deep language model to learn the sequence determinants for auxiliary processes in MHC-ligand selection, such as proteasome cleavage, that are omitted by existing methods due to the lack of labeled data.

Third, we develop machine learning frameworks to model the enrichment of an antibody sequence in phage-panning experiments against a target antigen. We show that antibodies with low specificity can be reduced by a computational procedure using machine learning models trained for multiple targets. Moreover, machine learning can help to design novel antibody sequences with improved affinity.

Research Areas:
AI & Machine Learning, Computational Biology

Impact Areas:
Education, Health Care

This event is not part of a series.

Created by Linda Lynch Email at Thursday, April 18, 2019 at 12:19 PM.