Synthetic Data Generation via Cluster-Specific Latent Space Generative Modeling: A Modular and Explainable Framework


Feyiz K., Ozkazanc Y.

IEEE Access, vol.14, pp.73401-73419, 2026 (SCI-Expanded, Scopus) identifier

  • Publication Type: Article / Article
  • Volume: 14
  • Publication Date: 2026
  • Doi Number: 10.1109/access.2026.3692685
  • Journal Name: IEEE Access
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
  • Page Numbers: pp.73401-73419
  • Keywords: Clustering, explainability, latent representation learning, latent space modeling, modularity, multivariate Gaussian distributions, synthetic data generation
  • Hacettepe University Affiliated: Yes

Abstract

Deep learning increasingly depends on large, diverse, and representative datasets, yet in many practical domains data are scarce, imbalanced, or restricted by privacy and regulation. Synthetic data is therefore an important tool for data augmentation and learning under restricted-access conditions, but many deep generative models have fragile training, sensitive hyper-parameters, and opaque sampling. We propose a modular and explainable latent space framework for synthetic data generation based on four main decoupled components operating sequentially: a Latent Space Representation Learner implemented with standard auto-encoders, a Latent Space Representation Clustering that includes label-based partitioning for labeled datasets and an unsupervised polarized hierarchical clustering algorithm proposed for unlabeled datasets, a Cluster Distribution Modeler that fits multivariate Gaussian distribution models on each cluster, and a Synthetic Data Generator component that operates entirely in the latent space and generates synthetic data. On top of these models, we propose several generation methods, including controlled random sampling, deterministic and random walks on iso-density ellipsoids, intra-cluster morphing via convex combinations, and mask-based crossover schemes that can optionally be combined with Mahalanobis distance and chi-square based accept/reject and back-projection mechanisms. Experiments on three benchmark image datasets with both labeled and unlabeled variants show that the framework generates cluster-consistent, distributionally coherent, and visually diverse data while remaining computationally light. Latent space Fréchet and Bhattacharyya distances between original and generated distributions validate the Gaussian random sampling, while the proposed Cluster Adherence Rate (CAR) provides a compact cluster-consistency metric for the Gaussian mixture based generation methods. Together, these metrics further support the suitability of the cluster-specific Gaussian modeling. Overall, the framework offers a practical and explainable foundation for synthetic data generation in data augmentation and restricted-access scenarios.