International Society for Clinical Biostistics conference 2019

Probabilistic latent variable modelling and inference with high dimensional data and correlated outcomes

Said el Bouhaddani1, Hae-Won Uh1, Sander van der Laan2, Folkert Asselbergs2, Jeanine Houwing-Duistermaat1,3, René Eijkemans1

1Dept. of Biostatistics and Research Support, UMC Utrecht, div. Julius Centre, the Netherlands 2Dept. of Cardiology, Division Heart & Lungs, UMC Utrecht, the Netherlands 3Dept. of Statistics, University of Leeds, United Kingdom

Context: Atherosclerosis underlies many common cardiovascular diseases, and is an important factor in the global death rate. It is characterised by plaque formation in the arteries. This formation is studied by quantifying histological characteristics of the plaque. Since a substantial part of atherosclerosis is explained by genetic variation, several studies have focussed on the genetic contributions to plaque formation. Typically, univariate approaches are used to link one histology trait to one genetic variant. However, atherosclerosis is a complex polygenic disease and the histological markers are correlated. Traditional methods ignore correlations between the outcome variables and cannot handle many predictors.

Objective: Our aim is to develop a multivariate method to estimate genetic contributions to histological characteristics of atherosclerosis simultaneously across high dimensional predictors and correlated outcomes.

Method: Extending Probabilistic Partial Least Squares (PPLS)[1], we propose Probabilistic Orthogonal PLS (POPLS), a latent variable model for simultaneously associating a set of predictors with a set of outcomes. The model decomposes both sets in joint and residual parts. The joint parts comprise latent variables that capture the relation between predictors and outcomes. Statistical inference is then performed in the lower dimensional latent space. The mapping to this space provides information about significant genetic regions involved in the joint parts. The novelty in POPLS is firstly the model linking the latent variables that better represents predictors-outcomes associations, and secondly the inclusion of genetic-specific latent variables in the model of the predictors to correct for substantial genetic variation unrelated to the outcomes. Maximum likelihood estimators are calculated using an EM algorithm that can handle high dimensional data. From the associated information matrix, asymptotic standard errors are obtained, enabling statistical inference.

Results: Simulations are conducted to evaluate the POPLS performance in terms of bias, power and coverage probabilities. Secondly, we apply POPLS to 80 million genetic variants and 9 histology traits from 1440 patients. From the results, we identify which genetic regions are associated with histology and how strong this association is.

Conclusions: PO2PLS provides a statistical framework to simultaneously estimate and infer associations between high dimensional predictors and correlated outcomes.

[1] el Bouhaddani, S., Uh, H.-W., Hayward, C., Jongbloed, G., & Houwing-Duistermaat, J. (2018). Probabilistic partial least squares model: Identifiability, estimation and application. Journal of Multivariate Analysis, 167, 331–346. http://doi.org/10.1016/j.jmva.2018.05.009 https://arxiv.org/abs/1706.03597