Comparative Study of Logistic Regression, Random Forest, and XGBoost for Bank Loan Approval Classification

Bank loan approval plays a vital role in ensuring financial institutions can minimize credit risk while supporting economic growth. Default prediction is a crucial aspect of banking credit risk management. This study compares three machine learning algorithms Logistic Regression, Random Forest, and...

Full description

Saved in:
Bibliographic Details
Published in:Journal of Applied Informatics and Computing Vol. 9; no. 5; pp. 2822 - 2835
Main Authors: Putra, Hamdika, Rumini, Rumini
Format: Journal Article
Language:English
Indonesian
Published: Politeknik Negeri Batam 19.10.2025
Subjects:
ISSN:2548-6861, 2548-6861
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Bank loan approval plays a vital role in ensuring financial institutions can minimize credit risk while supporting economic growth. Default prediction is a crucial aspect of banking credit risk management. This study compares three machine learning algorithms Logistic Regression, Random Forest, and Extreme Gradient Boosting (XGBoost) to classify bank loan approvals using a combination of application, previous application, and bureau datasets. The workflow includes data merging, cleaning, missing value imputation, handling unknown values, feature engineering (such as converting day-based variables into years, calculating total submitted documents, income-to-annuity ratio, and employment-to-income ratio), encoding (label and one-hot), scaling (min-max normalization), feature selection based on correlation analysis, handling class imbalance with SMOTE, as well as modeling and evaluation using Accuracy, Precision, Recall, F1-score, and AUC. The results show that Logistic Regression yields the highest AUC of 0.741498, outperforming Random Forest (0.713758) and XGBoost (0.715944). From a business perspective, implementing the best model reduced the Loss Given Default (LGD) by 39.77 %, from $1,705,098,055.50 to $1,026,944,185.50. This finding confirms that simpler models remain competitive on imbalanced datasets when supported by appropriate preprocessing and balancing strategies. Bank loan approval plays a vital role in ensuring financial institutions can minimize credit risk while supporting economic growth. Default prediction is a crucial aspect of banking credit risk management. This study compares three machine learning algorithms Logistic Regression, Random Forest, and Extreme Gradient Boosting (XGBoost) to classify bank loan approvals using a combination of application, previous application, and bureau datasets. The workflow includes data merging, cleaning, missing value imputation, handling unknown values, feature engineering (such as converting day-based variables into years, calculating total submitted documents, income-to-annuity ratio, and employment-to-income ratio), encoding (label and one-hot), scaling (min-max normalization), feature selection based on correlation analysis, handling class imbalance with SMOTE, as well as modeling and evaluation using Accuracy, Precision, Recall, F1-score, and AUC. The results show that Logistic Regression yields the highest AUC of 0.741498, outperforming Random Forest (0.713758) and XGBoost (0.715944). From a business perspective, implementing the best model reduced the Loss Given Default (LGD) by 39.77 %, from $1,705,098,055.50 to $1,026,944,185.50. This finding confirms that simpler models remain competitive on imbalanced datasets when supported by appropriate preprocessing and balancing strategies.
ISSN:2548-6861
2548-6861
DOI:10.30871/jaic.v9i5.10862