Use of multivariate statistical methods for the analysis of metabolomic data
- Hervás Marín, David
- José Manuel Prats Montalbán Director
- Agustín Lahoz Rodríguez Director
Universidade de defensa: Universitat Politècnica de València
Fecha de defensa: 07 de outubro de 2019
- Sebastiá Balasch Parisi Presidente/a
- Antonio Pineda Lucena Secretario
- Riccardo Leardi Vogal
Tipo: Tese
Resumo
In the last decades, advances in technology have enabled the gathering of an increasingly amount of data in the field of biology and biomedicine. The so called "-omics" technologies such as genomics, epigenomics, transcriptomics or metabolomics, among others, produce hundreds, thousands or even millions of variables per data set. The analysis of 'omic' data presents different complexities that can be methodological and computational. This has driven a revolution in the development of new statistical methods specifically designed for dealing with these type of data. To this methodological complexities one must add the logistic and economic restrictions usually present in scientific research projects that lead to small sample sizes paired to these wide data sets. This makes the analyses even harder, since there is a problem in having many more variables than observations. Among the methods developed to deal with these type of data there are some based on the penalization of the coefficients, such as lasso or elastic net, others based on projection techniques, such as PCA or PLS, and others based in regression or classification trees and ensemble methods such as random forest. All these techniques work fine when dealing with different 'omic' data in matrix format (IxJ), but sometimes, these IxJ data sets can be expanded by taking, for example, repeated measurements at different time points for each individual, thus having IxJxK data sets that raise more methodological complications to the analyses. These data sets are called three-way data. In this cases, the majority of the cited techniques lose all or a good part of their applicability, leaving very few viable options for the analysis of this type of data structures. One useful tool for analyzing three-way data, when some Y data structure is to be predicted, is N-PLS. N-PLS reduces the inclusion of noise in the models and obtains more robust parameters when compared to PLS while, at the same time, producing easy-to-understand plots. Related to the problem of small sample sizes and exorbitant variable numbers, comes the issue of variable selection. Variable selection is essential for facilitating biological interpretation of the results when analyzing 'omic' data sets. Often, the aim of the study is not only predicting the outcome, but also understanding why it is happening and also what variables are involved. It is also of interest being able to perform new predictions without having to collect all the variables again. Because all of this, the main goal of this thesis is to improve the existing methods for 'omic' data analysis, specifically those for dealing with three-way data, incorporating the ability of variable selection, improving predictive capacity and interpretability of results. All this will be implemented in a fully documented R package, that will include all the necessary functions for performing complete analyses of three-way data. The work included in this thesis consists in a first theoretical-conceptual part where the idea and development of the algorithm takes place, as well as its tuning, validation and assessment of its performance. Then, a second empirical-practical part comes where the algorithm is compared to other variable selection methodologies. Finally, an additional programming and software development part is presented where all the R package development takes place, and its functionality and capabilities are exposed. The development and validation of the technique, as well as the publication of the R package, has opened many future research lines.