Reference no: EM132400246
FNCE30012 Foundations of FinTech Assignment - The University of Melbourne, Australia
Important: It is important that you do not change the type (markdown vs. code) of any cell, nor copy/paste/duplicate any cell! If the cell type is markdown, you are supposed to write text, not code, and vice versa. Provide your answer to each question in the allocated cell. Do not create additional cells. Answers provided in any other cell will not be marked. Do not rename the assignment files. All files in the assignment directory should be left as is.
Setting - Equifax Australia has provided us with synthetic loan application data from Australian proprietary companies. This data was generated to match the characteristics of actual lending proposals approved between February 2017 and March 2018. The Equifax data consists of two parts, which, to make it easier for you, we have merged together into one data set:
1. Company Business Trading History Data: This first part of the data set contains historical business trading data from 25,000 Australian proprietary companies who were granted a loan between February 2017 and March 2018.
2. Director Data: This second part of the data set contains information on up to four directors of each company. In case a company has more than one director, the corresponding data has been averaged across directors at the company level.
Since this is proprietary data that belongs to Equifax, we are not allowed to give you direct access to it. However, thanks to Jupyter Hub, you are able to access it remotely. In particular, using your knowledge from Tutorial 9, you are able to analyse it at an aggregate level and to use it for the estimation of credit scoring models.
The file called Equifax_Data_Dictionary.xlsx provides you with the dictionary for both company and director level data.
Question 1 - Write a Python code that creates two bar plots of average default rates (Commercial_GBF_12m) depending on (i) whether a company was under external administration (External_Admin) or (ii) had filed petitions (Petitions). Make sure your plots' axes are appropriately labelled.
Question 2 - Write a Python code that creates two plots of average default rates (Commercial_GBF_12m) as a function of (i) the number of months since a director's last commercial default (ny7589_df_time_1) and (ii) the frequency of adverse commercial events over four years 48 months prior to application (ny7601_adv_48_84m). Make sure your plots' axes are appropriately labelled.
Question 3 - How do you interpret the above plots from Questions 1 and 2? What is your conclusion?
Question 4 - Run a full-fledged logistic regression model without any ex-ante feature selection. Based on the estimation output, select and report all features that are significant at the 5%-level (or below).
Note: To increase the stability of the estimation, Python will automatically omit certain variables.
Question 5 - Run a logit model using the function send_logit_request() and applying the following specifications:
1. Relative size of test data: 20%
2. Only use the features from Question 4 with a significance level below 5%
3. Scaling: "True"
Evaluate the testing performance of your logit model.
Question 6 - Write a Python code that estimates a series of full-fledged neural networks with the following specifications:
1. Number of layers: 1
2. Number of units: 2, 4, 16, 64, 256
3. Relative size of test data: 20%
4. Scaling: "True"
Generate one plot that shows each model's ROC ("roc"), both for testing and training. What is your conclusion?
Question 7 - Based on the testing performance of the above five neural network model, which one would you pick and why? Rerun the estimation of your chosen model.
Question 8 - Conduct an in-depth comparison between the "simple" logit model (Question 5) and your preferred neural network (Question 7). What are their respective potential advantages and disadvantages? If you were to run a credit scoring agency, which type of model you think your clients would prefer?
Question 9 - The average loan amount across the Equifax sample is $75,000. Furthermore, let us make the following simplifying assumptions:
1. The interest rate charged for each loan under the simple logit model (Question 5) is 5% p.a.
2. Each loan has a duration of one year
3. If a loan defaults, the total amount is lost (zero recovery) and no interest payments occur
4. A loan application only gets granted, if the respective model predicts no default
5. Each granted loan generates administrative costs of 1% p.a.
You are running a business that lends loans of $75,000 to small companies. Based on the above testing data, what is the interest rate implied by your chosen neural network (Question 7), such that you will generate the same net income as under the simple logit model?
Note: For your calculations, you can neglect any time value of money effects.
Question 10 - When you go through the dictionary provided in the file Equifax_Data_Dictionary.xlsx, you will notice that Equifax uses primarily legal data rather than accounting data to predict defaults. Why do you think that is?
Question 11 - Discuss the pros and cons of using deep learning, i.e., hierarchical machine learning applied to big data, in the context of credit scoring.
Attachment:- Foundations of Fintech Assignment File.rar