Dr. Elena Parkhomenko

Research Assistant, University Health Network

Biostatistician, Hospital for Sick Children

Title: " Sparse Canonical Correlation Analysis "

Large scale genomic studies of the association of gene expression with
multiple phenotypic or genotypic measures may require the identification
of complex multivariate relationships. In multivariate analysis a common
way to inspect the relationship between two sets of variables based on
their correlation is Canonical Correlation Analysis, which determines linear
combinations of all variables of each type with maximal correlation between
the two linear combinations. However, in high dimensional data analysis,
when the number of variables under consideration exceeds tens of thousands,
linear combinations of the entire sets of features may lack biological
plausibility and interpretability. In addition, insufficient sample size
may lead to computational problems, inaccurate estimates of parameters and
non-generalizable results. These problems may be solved by selecting sparse
subsets of variables, i.e. obtaining sparse loadings in the linear
combinations of variables of each type. However, available methods providing
sparse solutions, such as Sparse Principal Component Analysis, consider each
type of variables separately and focus on the correlation within each set of
measurements rather than between sets. We introduce new methodology - Sparse
Canonical Correlation Analysis (SCCA), which examines the relationships of
many variables of different types simultaneously. It solves the problem of
biological interpretability by providing sparse linear combinations that
include only a small subset of variables. SCCA maximizes the correlation
between the subsets of variables of different types while performing
variable selection. In large scale genomic studies sparse solutions also
comply with the belief that only a small proportion of genes are expressed
under a certain set of conditions.
In this work I present methodology for SCCA and evaluate its properties
using simulated data. I illustrate practical use of SCCA by applying it
to the study of natural variation in human gene expression for which the
data have been provided as problem 1 for the fifteenth Genetic Analysis
Workshop (GAW15). I also present two extensions of SCCA - adaptive SCCA
and modified adaptive SCCA. Their performance is evaluated and compared
using simulated data and adaptive SCCA is applied to the GAW15 data.