SciRepID - Benchmarking Machine Learning Models for Large-Scale Loan Default Prediction Using Real Data


Benchmarking Machine Learning Models for Large-Scale Loan Default Prediction Using Real Data

Journal of Information Technology and Computer Science
International Forum of Researchers and Lecturers (IFREL)

📄 Abstract

This research benchmarks multiple machine learning (ML) algorithms for large-scale loan default prediction using a real-world dataset of 255,000 borrower records, where default cases represent only ~9–12% of total observations. The study addresses the persistent gap in comparative analyses of ML models that balance predictive accuracy, interpretability, and computational efficiency for credit risk assessment. Six algorithmic families were evaluated Logistic Regression, Random Forest, XGBoost, LightGBM, CatBoost, Artificial Neural Networks (ANN), and Stacked Ensemble—using standardized preprocessing, hybrid imbalance handling (SMOTE, class weighting, under-sampling), and comprehensive evaluation metrics (AUC, F1, Recall, Precision, PR-AUC, and Brier Score). Empirical results show Logistic Regression achieved the highest AUC of 0.732, outperforming nonlinear models under the baseline configuration, while LightGBM attained perfect recall (1.0) but low precision (0.116), indicating over-prediction of defaults. Gradient boosting models demonstrated robust calibration (Brier ≈ 0.114–0.116) and the best computational efficiency, with LightGBM showing the fastest training and lowest memory use. CatBoost exhibited strong recall but the slowest computation, and ANN underperformed on tabular data (AUC ≈ 0.56). The Stacked Ensemble delivered balanced results with AUC = 0.664 and improved overall stability. These findings confirm that boosting-based models, particularly LightGBM and CatBoost, offer superior scalability and calibration, whereas Logistic Regression remains a valuable interpretable baseline. The study concludes that effective default prediction requires integrating rebalancing, calibration, and threshold optimization to enhance recall and operational deployment reliability in large-scale credit ecosystems.

🔖 Keywords

#loan default prediction; machine learning; LightGBM; benchmarking; credit risk analytics

ℹ️ Informasi Publikasi

Tanggal Publikasi
08 March 2026
Volume / Nomor / Tahun
Volume 2, Nomor 1, Tahun 2026

📝 HOW TO CITE

Devianto, Yudo; Saragih, Rusmin; Cahyana, Yana, "Benchmarking Machine Learning Models for Large-Scale Loan Default Prediction Using Real Data," Journal of Information Technology and Computer Science, vol. 2, no. 1, Mar. 2026.

ACM
ACS
APA
ABNT
Chicago
Harvard
IEEE
MLA
Turabian
Vancouver

🔗 Artikel Terkait dari Jurnal yang Sama

📊 Statistik Sitasi Jurnal

Tren Sitasi per Tahun