Optimal training and test sets design for machine learning

Creative Commons License

GENÇ B., Tunc H.

TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, vol.27, no.2, pp.1534-1545, 2019 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 27 Issue: 2
  • Publication Date: 2019
  • Doi Number: 10.3906/elk-1807-212
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, TR DİZİN (ULAKBİM)
  • Page Numbers: pp.1534-1545
  • Keywords: Distribution matching, instance selection, training set selection, optimization, PROTOTYPE SELECTION, MUTUAL INFORMATION, ALGORITHM
  • Hacettepe University Affiliated: Yes


In this paper, we describe histogram matching, a metric for measuring the distance of two datasets with exactly the same features, and embed it into a mixed integer programming formulation to partition a dataset into fixed size training and test subsets. The partition is done such that the pairwise distances between the dataset and the subsets are minimized with respect to histogram matching. We then conduct a numerical study using a well-known machine learning dataset. We demonstrate that the training set constructed with our approach provides feature distributions almost the same as the whole dataset, whereas training sets constructed via random sampling end up with significant differences. We also show that our method introduces neither positive nor negative bias in prediction accuracy of a decision tree-used as a representative example of a machine learning method.