Comparative Study of Logistic Regression, Random Forest, and XGBoost for Bank Loan Approval Classification
Bank loan approval plays a vital role in ensuring financial institutions can minimize credit risk while supporting economic growth. Default prediction is a crucial aspect of banking credit risk management. This study compares three machine learning algorithms Logistic Regression, Random Forest, and...
Saved in:
| Published in: | Journal of Applied Informatics and Computing Vol. 9; no. 5; pp. 2822 - 2835 |
|---|---|
| Main Authors: | , |
| Format: | Journal Article |
| Language: | English Indonesian |
| Published: |
Politeknik Negeri Batam
19.10.2025
|
| Subjects: | |
| ISSN: | 2548-6861, 2548-6861 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Bank loan approval plays a vital role in ensuring financial institutions can minimize credit risk while supporting economic growth. Default prediction is a crucial aspect of banking credit risk management. This study compares three machine learning algorithms Logistic Regression, Random Forest, and Extreme Gradient Boosting (XGBoost) to classify bank loan approvals using a combination of application, previous application, and bureau datasets. The workflow includes data merging, cleaning, missing value imputation, handling unknown values, feature engineering (such as converting day-based variables into years, calculating total submitted documents, income-to-annuity ratio, and employment-to-income ratio), encoding (label and one-hot), scaling (min-max normalization), feature selection based on correlation analysis, handling class imbalance with SMOTE, as well as modeling and evaluation using Accuracy, Precision, Recall, F1-score, and AUC. The results show that Logistic Regression yields the highest AUC of 0.741498, outperforming Random Forest (0.713758) and XGBoost (0.715944). From a business perspective, implementing the best model reduced the Loss Given Default (LGD) by 39.77 %, from $1,705,098,055.50 to $1,026,944,185.50. This finding confirms that simpler models remain competitive on imbalanced datasets when supported by appropriate preprocessing and balancing strategies.
Bank loan approval plays a vital role in ensuring financial institutions can minimize credit risk while supporting economic growth. Default prediction is a crucial aspect of banking credit risk management. This study compares three machine learning algorithms Logistic Regression, Random Forest, and Extreme Gradient Boosting (XGBoost) to classify bank loan approvals using a combination of application, previous application, and bureau datasets. The workflow includes data merging, cleaning, missing value imputation, handling unknown values, feature engineering (such as converting day-based variables into years, calculating total submitted documents, income-to-annuity ratio, and employment-to-income ratio), encoding (label and one-hot), scaling (min-max normalization), feature selection based on correlation analysis, handling class imbalance with SMOTE, as well as modeling and evaluation using Accuracy, Precision, Recall, F1-score, and AUC. The results show that Logistic Regression yields the highest AUC of 0.741498, outperforming Random Forest (0.713758) and XGBoost (0.715944). From a business perspective, implementing the best model reduced the Loss Given Default (LGD) by 39.77 %, from $1,705,098,055.50 to $1,026,944,185.50. This finding confirms that simpler models remain competitive on imbalanced datasets when supported by appropriate preprocessing and balancing strategies. |
|---|---|
| ISSN: | 2548-6861 2548-6861 |
| DOI: | 10.30871/jaic.v9i5.10862 |