Multi-view Learning of Speech Features with Linear and Non-linear Canonical Correlation Analysis
, Toyota Technological Institute at Chicago
Date: Monday, November 25, 2013
Time: 4:00 PM to 5:00 PM Note: all times are in the Eastern Time Zone
Refreshments: 3:45 PM
Location: 32-G882 (Stata Center - Hewlett Room)
Host: Jim Glass, MIT CSAIL
Contact: Marcia G. Davidson, 617-253-3049, firstname.lastname@example.org
Speaker URL: None
TALK: Multi-view Learning of Speech Features with Linear and Non-linear Canonical Correlation Analysis
This talk describes an approach to learning improved acoustic feature vectors for automatic speech recognition. Typically, speech recognizers use a parametrization of the acoustic signal based on mel-frequency cepstral coefficients (MFCCs), perceptual linear prediction coefficients (PLPs), or related representations. It is often possible to improve performance by forming large feature vectors consisting of multiple consecutive frames of such standard features, followed by dimensionality reduction using a learned transformation. The learned transformation may be unsupervised (e.g., principal components analysis) or supervised (e.g., linear discriminant analysis, neural network-based representations).
This talk will describe a recent approach that is unsupervised, but using a second "view" of the data (in our case, articulatory measurements) as additional information for transformation learning. The approach we take, using canonical correlation analysis (CCA) and its nonlinear extensions, finds representations of the two views that are maximally correlated. This approach avoids some of the disadvantages of other unsupervised methods, such as PCA, which are sensitive to noise and data scaling, and possibly of supervised methods, which are more task-specific.
This talk will cover recent work in this setting using CCA, its nonlinear extension via kernel CCA, and a newly proposed, parametric nonlinear extension using deep neural networks dubbed deep CCA. Results to date show that the approach can be used to improve performance on tasks such as phonetic classification and recognition, and that the improvements generalize to new speakers for which no data from the second view is available.
Time permitting, the talk will also include additional recent work using articulatory information for other tasks, including low-resource spoken term detection, lexical access, and sign language recognition.
Karen Livescu is an Assistant Professor at TTI-Chicago. She completed her PhD at MIT in the CSAIL Spoken Language Systems group, and was a post-doctoral lecturer in the MIT EECS department. Karen's main research interests are in speech and language processing, with a slant toward combining machine learning with knowledge about linguistics and speech science. Her recent work has included multi-view learning of speech representations, articulatory models of pronunciation variation, discriminative training with low resources for spoken term detection and pronunciation modeling, and automatic sign language recognition. She is a member of the IEEE Spoken Language Technical Committee, associate editor for IEEE Transactions on Audio, Speech, and Language Processing, and subject editor for Speech Communication. She is an organizer/co-organizer of a number of recent workshops, including the ISCA SIGML workshops on Machine Learning in Speech and Language Processing, the Midwest Speech and Language Days, and the Interspeech Workshop on Speech Production in Automatic Speech Recognition.
Created by Marcia G. Davidson at Thursday, October 24, 2013 at 4:53 PM.