Reference no: EM131969822
Accurately predicting whether a loan will be repaid (credit scoring) is an important task for anybank. Consistent accuracy benefits both the banks and the loan applicants.
In utilizing the data mining technique, such as logistic regression, we can further analyze the data, specifically by each variable's role in determining whether an individual has good credit or bad credit.
After generating a logistic regression model, referring to the confusion matrix report helps us determine the cost/gain matrix for the validation data.
This information allows the net profit to be calculated which then allows for a thorough analysis of the data. This technique then allows the performance to be improved based on the results generated and what decisions to make for scoring future applicants in extending credit.
Purpose
The German Credit contains 1,000 cases and 30 variables. This variable contains information about whether each customer's credit is deemed good or Bad.
Each applicant was rated as ‘'good credit'' (700 Case) or ‘'bad credit (300 case). The purpose of this report is to present the result of analyzing the German credit case by using GermanCredit.xls which is the data set for this case.
Analysis
We ran a logistic regression model to determine the odds of each predictor affecting the output, in order to identify the predictor variables the had the most impact on the Y output variables the most.
Initially we analyzed the P-value to profile the tasks. We then compared the odds of the different predictors, so we can see immediately which predictors have the most impact (given that the other predictors are accounted for) and which have the least impact.
The top 5 predictors that had the highest significance were:
• FOREIGN WORKER (0 = No, 1 = Yes)
• GUARANTOR (0 = No, 1 = Yes)
• USED_CAR = Purpose of credit (Car Used → 0 = No, 1 = Yes)
• CHK_ACCT (1: 0 ← ... < ... 200 DM, 2: → 200 DM, 3: No Checking Account) - DM = Deposit Monthly
• CATECORGICAL
• MALE_SINGLE (0 = No, 1 = Yes)
Next, we will discuss the confusion matrix. In the training set, we see 387 records belonging to the Success Class that were correctly assigned to that class, while 42 records belonging to the Success Class were incorrectly assigned to the Failure Class.
A total of 90 records belonging to the Failure class were correctly assigned to this same class, while 81 records belonging to the Failure Class were incorrectly assigned to the Success Class. The total number of misclassified records was 123 (42+81), which resulted in an error equal to 20.5%.
In the Validation Set, 250 records were correctly classified as belonging to the Success Class, while 21 cases were incorrectly assigned to the Failure Class. A total of 55 cases were correctly classified as belonging to the Failure Class. 74 records were incorrectly classified as belonging to the Success Class when they were members of the Failure Class. This resulted in a total classification error of 23.75%.
Consequently, we identified the net profit of extending credit.
The consequences of misclassification have been assessed as follows. The costs of a false positive (incorrectly saying that an applicant is a good credit risk) outweigh the benefits of a true positive (correctly saying that an applicant is a good credit risk) by a factor of 5. We calculated the true positive net profit and the losses to find out the cumulative net profit by the following steps.
We sorted the validation on "predicted probability of success." For each case, we calculated the net profit of extending credit. Lastly, we added another column for cumulative net profit. We calculated the net profit by using the equation: =IF(OR(AND(B15=1,B15=C15),(AND(B15=0,B15=C15))),100,-500), where B is the predicted class and the C is the actual class.
This equation classified each cell to -500 if an individual is only classified as a false positive and as 100 if it's a true positive incorrectly saying that an applicant is a good credit risk. We found the maximum cumulative net profit to be $8,700 and that was in the 33.75 percentile. This means that we have to go 33.75% through the data until the maximum profit is reached. Lastly, we calculated the cumulative net profit to be $-457,400. Companies need to be more aware and mindful of their mistakes in order to avoid high losses.
Conclusions
If this logistic regression model is scored to future applicants, the probability of success cutoff used should be increased from 0.5 to 0.75 in extending credit. The use of this model can help businesses predict the credit scores of consumers to reduce the errors produced and in turn reduce costs. Lastly, it can allow business to increase the credit score of consumers by identifying the patterns in low credit scores.
Attachment:- Group Case Report.rar