Visual research on the trustability of classical variable selection methods in Cox regression

Creative Commons License


HACETTEPE JOURNAL OF MATHEMATICS AND STATISTICS, vol.49, no.2, pp.869-886, 2020 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 49 Issue: 2
  • Publication Date: 2020
  • Doi Number: 10.15672/hujms.630402
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, zbMATH, TR DİZİN (ULAKBİM)
  • Page Numbers: pp.869-886
  • Hacettepe University Affiliated: Yes


Multivariate models such as the Cox regression model, if developed carefully, are powerful tools for making prognostic prediction which are frequently used in studies of clinical outcomes. Many applications require a large number of variables to be modelled by using a relatively small patient sample. Determination of the important variables in a model is critical to understand the behaviour of phenomena as the independent variables contribute the most to the outcome. From a practical perspective, a small subset of independent variables are usually selected from a large data set without the loss of any predictive efficiency. Automatic variable selection algorithms in scientific studies are commonly used for obtaining interpretable and practically applicable models. However, the careless use of these methods may lead to statistical problems. The performance of the generated models may be poor due to the violation of assumption, omission of the important variables, problems of overfitting, and the problem of multicollinearity and outliers. In order to enhance the accuracy of a model, it is essential to explore the data and its main characteristics before making any statistical inference. This study suggests an approach for acquiring a trustworthy model selection procedure for survival data by performing classical variables selection methods, accompanied by a graphical visualization method, namely robust coplot. Thus, it enables us to investigate the discrimination of observations, clusters of the variables and clusters of the observations that are highly characterized by a particular variable in a one graph. We present an application of combined method, as an integral part of statistical modelling, on survival data on multiple myeloma to show how coplot results are used in automatic variable selection algorithm in Cox regression model-building.