Optimal training and test sets design for machine learning


Creative Commons License

GENÇ B., Tunc H.

TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, cilt.27, sa.2, ss.1534-1545, 2019 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 27 Sayı: 2
  • Basım Tarihi: 2019
  • Doi Numarası: 10.3906/elk-1807-212
  • Dergi Adı: TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, TR DİZİN (ULAKBİM)
  • Sayfa Sayıları: ss.1534-1545
  • Anahtar Kelimeler: Distribution matching, instance selection, training set selection, optimization, PROTOTYPE SELECTION, MUTUAL INFORMATION, ALGORITHM
  • Hacettepe Üniversitesi Adresli: Evet

Özet

In this paper, we describe histogram matching, a metric for measuring the distance of two datasets with exactly the same features, and embed it into a mixed integer programming formulation to partition a dataset into fixed size training and test subsets. The partition is done such that the pairwise distances between the dataset and the subsets are minimized with respect to histogram matching. We then conduct a numerical study using a well-known machine learning dataset. We demonstrate that the training set constructed with our approach provides feature distributions almost the same as the whole dataset, whereas training sets constructed via random sampling end up with significant differences. We also show that our method introduces neither positive nor negative bias in prediction accuracy of a decision tree-used as a representative example of a machine learning method.