SciRepID - Scientific Publication Search

UNMASKING FRAUDSTERS: Ensemble Features Selection to Enhance Random Forest Fraud Detection

Akazue, Maureen Ifeanyi; Debekeme, Irene Alamarefa; Edje, Abel Efe; Asuai, Clive; Osame, Ufuoma John

Journal of Computing Theories and Applications• 2023 •Universitas Dian Nuswantoro

Fraud detection is used in various industries, including banking institutes, finance, insurance, government agencies, etc. Recent increases in the number of fraud attempts make fraud detection crucial for safeguarding financial information that is confidential or personal. Many types of fraud problems exist, including card-not-present fraud, fake Marchant, counterfeit checks, stolen credit cards, and others. An ensemble feature selection technique based on Recursive feature elimination (RFE), Information gain (IG), and Chi-Squared (X2) in concurrence with the Random Forest algorithm, was proposed to give research findings and results on fraud detection and prevention. The objective was to choose the essential features for training the model. The Receiver Operating Characteristic (ROC) Score, Accuracy, F1 Score, and Precision are used to evaluate the model's performance. The findings show that the model can differentiate between fraudulent transactions and those that are not, with an ROC Score of 95.83% and an Accuracy of 99.6%. The F1 Score of 99.6%% and precision of 100% further sustain the model's ability to detect fraudulent transactions with the least false positives correctly. The ensemble feature selection technique reduced training time and did not compromise the model's performance, making it a valuable tool for businesses in preventing fraudulent transactions.

https://doi.org/10.33633/jcta.v1i2.9462

Open Access Website Google Scholar

Analisis Perbandingan Algoritma XGBoost dan Algoritma Random Forest Ensemble Learning pada Klasifikasi Keputusan Kredit

Jan Melvin Ayu Soraya Dachi; Pardomuan Sitompul

Jurnal Riset Rumpun Matematika dan Ilmu Pengetahuan Alam• 2023 •Pusat riset dan Inovasi Nasional

Pemberian kredit selalu memiliki risiko seperti kredit macet, sehingga pihak kreditur (bank) dituntut untuk lebih objektif dan akurat dalam mengevaluasi setiap permohonan kredit. Penelitian ini dilakukan guna menemukan algoritma mana yang paling akurat dalam memberikan suatu keputusan kredit, dengan melakukan perbandingan terhadap algoritma XGBoost dan algoritma Random Forest. Pada kedua algoritma digunakan data berukuran 10.000 dan 100.000 dengan 19 variabel yang relevan dalam pengambilan keputusan kartu kredit. Proses penelitian ini melibatkan pre-processing data, splitting data, training data, parameter tuning dengan Random Search, testing data, serta evaluasi model dengan confusion matrix. Hasil eksperimen menunjukkan bahwa kedua algoritma menghasilkan kinerja model yang cukup kompetitif, dimana XGBoost mampu mencapai 1.0 untuk semua metrik evaluasi baik pada data berukuran 10.000 maupun data berukuran 100.000. Random Forest sendiri berakurasi 0.998 untuk data berukuran 10.000 dan 0.999 untuk data berukuran 100.000. Akan tetapi, Random Forest hanya mampu mencapai F1-score sebesar 0.700 untuk data berukuran 10.000. Berdasarkan hasil yang diperoleh dalam penelitian ini, dapat disimpulkan bahwa kedua algoritma memiliki performa yang sangat baik dan akurat dalam mengklasifikasikan keputusan pada data kartu kredit. Namun, Random Forest kurang akurat bila digunakan pada data berukuran kecil yang tidak seimbang.

https://doi.org/10.55606/jurrimipa.v2i2.1470

Open Access Website Google Scholar

Dataset and Feature Analysis for Diabetes Mellitus Classification using Random Forest

32 Citations

Mustofa, Fachrul; Safriandono, Achmad Nuruddin; Muslikh, Ahmad Rofiqul; Setiadi, De Rosal Ignatius Moses

Journal of Computing Theories and Applications• 2023 •Universitas Dian Nuswantoro

Diabetes Mellitus is a hazardous disease, and according to the World Health Organization (WHO), diabetes will be one of the main causes of death by 2030. One of the most popular diabetes datasets is PIMA Indians, and this dataset has been widely tested on various machine learning (ML) methods, even deep learning (DL). But on average, ML methods are not able to produce good accuracy. The quality of the dataset and features is the most influential thing in this case, so deeper investment is needed to examine this dataset. This research will analyze and compare the PIMA Indians and Abelvikas datasets using the Random Forest (RF) method. The two datasets are imbalanced, in fact, the Abelvikas dataset is more imbalanced and has a larger number of classes so it is be more complex. The RF was chosen because it is one of the ML methods that has the best results on various diabetes datasets. Based on the test results, very contrasting results were obtained on the two datasets. Abelvikas had accuracy, precision, and recall, reaching 100%, and PIMA Indians only achieved 75% for accuracy, 87% for precision, and 80% for the best recall. Testing was done with 3, 5, 7, 10, and 15 tree number parameters. Apart from that, it was also tested with k-fold validation to get valid results. This determines that the features in the Abelvikas dataset are much better because more complete glucose features support them.

https://doi.org/10.33633/jcta.v1i1.9190

Open Access Website Google Scholar