Abstract
Methods: Three clustering algorithms, such as k-prototype, simple k-medoids, and Clustering of Mixed Numerical and Categorical Data with Missing Values (k-CMM) are compared. The pre-processing data, which is the imputation process for the first two algorithms, is conducted separately, while the k-CMM has an integrated imputation process. Both imputation stages are tree-based algorithms. Cluster evaluation is based on internal criteria and external criteria. Clusters resulting from the k-prototype and simple k-medoids are selected by internal validity indices and compared to k-CMM using external validity indices for several numbers of clusters (k = 3,4,5).
Result: According to data exploration, the IPD of Bogor Regency, West Java, Indonesia dataset contains ± 5% of outliers and six missing values in some chosen variables. Tree-based imputation methods are applied separately in k-prototype and simple k-medoids, jointly in k-CMM. Based on the elbow and gap statistics methods, this research aims to determine the optimum number of clusters k = 3. The internal validity indices performed on k-prototype and simple k-medoids resulting in three clusters (k = 3) are optimum. Trials on several clusters (k = 3,4,5) for three algorithms show that the k-prototype with k = 3 performs the best and is most stable among the two other algorithms with IPD datasets containing many outliers; external validity indices evaluate cluster results.
Novelty: This research addresses issues commonly found in mixed datasets, including outliers and missing values, and how to treat problems before and during cluster analysis. An improvement of Gower distance is applied in the medoid-based clustering algorithm, and the k-CMM algorithm is the first algorithm to integrate the imputation process and clustering analysis, which is interesting to explore this algorithm’s performance in clustering analysis.