Dr. Elena Parkhomenko

Research Assistant, University Health Network

Biostatistician, Hospital for Sick Children

* Title: *" Sparse Canonical Correlation Analysis "

* Abstract:*

Large scale genomic studies of the association of gene expression with

multiple phenotypic or genotypic measures may require the identification

of complex multivariate relationships. In multivariate analysis a common

way to inspect the relationship between two sets of variables based on

their correlation is Canonical Correlation Analysis, which determines linear

combinations of all variables of each type with maximal correlation between

the two linear combinations. However, in high dimensional data analysis,

when the number of variables under consideration exceeds tens of thousands,

linear combinations of the entire sets of features may lack biological

plausibility and interpretability. In addition, insufficient sample size

may lead to computational problems, inaccurate estimates of parameters and

non-generalizable results. These problems may be solved by selecting sparse

subsets of variables, i.e. obtaining sparse loadings in the linear

combinations of variables of each type. However, available methods providing

sparse solutions, such as Sparse Principal Component Analysis, consider each

type of variables separately and focus on the correlation within each set of

measurements rather than between sets. We introduce new methodology - Sparse

Canonical Correlation Analysis (SCCA), which examines the relationships of

many variables of different types simultaneously. It solves the problem of

biological interpretability by providing sparse linear combinations that

include only a small subset of variables. SCCA maximizes the correlation

between the subsets of variables of different types while performing

variable selection. In large scale genomic studies sparse solutions also

comply with the belief that only a small proportion of genes are expressed

under a certain set of conditions.

In this work I present methodology for SCCA and evaluate its properties

using simulated data. I illustrate practical use of SCCA by applying it

to the study of natural variation in human gene expression for which the

data have been provided as problem 1 for the fifteenth Genetic Analysis

Workshop (GAW15). I also present two extensions of SCCA - adaptive SCCA

and modified adaptive SCCA. Their performance is evaluated and compared

using simulated data and adaptive SCCA is applied to the GAW15 data.