Make a model that predicts the average bank

Assignment Help Basic Statistics
Reference no: EM131971457

Bank balances

It is ideal to choose the best model by enumerating all possible regression models. However, it is computationally heavy when there are many candidate independent variables with many observations. For example, when there are 10 candidate independent variables, total 1023 (=210-1) regression models should be investigated. When there are 100 variables, total 1267650600228230000000000000000-1 (=210-1) models are possible. Furthermore, if we include interaction terms (e.g. X1X2) or nonlinear terms (e.g. X12, sqrt(X1)), the number of models to be enumerated exponentially grows. In short, enumeration is ideal, but impossible or expensive in practice.

Consider this Bank balance case. The data on the second worksheet, "Banking Data" is acquired from banking and census records for different zip codes in the bank's current market. Such information would be useful when advertising for new customers or choosing locations for new brank offices. The data show the medians of age of population, years of education, income, home value, household wealth, and average bank balance. Our goal is to a build a model that predicts the bank balance with proper variables.

1) Make a model that predicts the average bank balance with all available independent variables (Age, Education, Income, Home value, and Wealth). Is the model appropriate? Answer with reasons and regression model results.

2) In question 1, you should find that the model is not appropriate because two variables are not significant at a=5%. How do we make a better and significant model? Do we simply drop all those insignificant variables and make a new model? The answer is "no" simply because we do not know what will happen when a single or multiple variables are dropped or added until we actually try all those models. One simple idea is dropping the most insignificant variable and making a new model. Then, keep repeating the trials until a significant model is found. This is called the backward stepwise regression modeling.

Choose a variable that has the highest p-value of t-test in the question 1 model, and make a new model without it. Analysis toolpak accepts data only in a single range. So you have to copy the data in a new worksheet and remove the chosen variable data by deleting its column. Then, you can choose all X variables in a range on Analysis toolpak.

Report whether the model is significant. If you found that the model is not significant, then drop a variable that has the highest p-value of t-test and make a new model. Repeat this until you find a significant model.

3) In multiple linear regression modeling, one issue that has to be carefully considered is multicollinearity. We did not consider this issue in the class.

A multiple linear regression model includes multiple independent variables. Two or multiple variables are sometimes correlated each other. For example, diameter and height of a tree may have a relationship. If those correlated variables are used for X variables in a regression model, those variables contribute redundant informationto the dependent variable in the model. It can generate unstable coefficients or an unexpected sign of coefficient bj. For example, if diameter and height of trees are included in a regression model to predict age of trees, that is,

Tree Age = b0 + b1(Diameter) + b2(Height),

we can easily expect that b1 and b2 should be positive because trees grow as it gets more ages. However, when diameter and height have a significant relationship, it would be possible to have negative values of b1 or b2. Or the coefficients of the model drastically fluctuate when new observations or new independent variables are added. Thus, we have to drop one or multiple correlated variables in the model. You remember one of the assumption of regression modeling is independency of errors. It means this issue.

Then, how do we detect which variables are correlated each other? There are many techniques to reveal the multicollinearity, but an easiest way is checking the correlation. The correlation measures the linear relationship between two variables. Its value is between -1 and 1. As the value is closer to 1, two variables have a stronger positive linear relationship. Reversely, as the value is closer to -1, two variables have a stronger negative linear relationship. As the value is closer to zero, it indicates two variables have a weaker relationship, which is the case we want to have in a regression model.

Consider this correlation table of the variables. You can easily get it using Correlation in Analysis toolpak. Try if you want.


Age

Education

Income

Home Value

Wealth

Bank Balance

Age

1






Education

0.17

1





Income

0.48

0.58

1




Home Value

0.39

0.75

0.80

1



Wealth

0.47

0.47

0.95

0.70

1


Bank Balance

0.57

0.55

0.95

0.77

0.95

1

Bank Balance (the dependent variable Y) has high linear relationships with most variables. Especially, Income and Wealth have the highest linear relationship. This indicates that Income and Wealth would contribute to explain Bank Balance well.

To avoid problems from multicollinearity in linear regression modeling, all independent variables should have no or weak correlations (close to zero) each other. But look at the correlation between Income and Wealth. Those two variables have a very high correlation of 0.95. If those two variables were in a multiple linear regression model, those variables ruin the model having troubles discussed above with the tree example.

For the reason, we have to find models excluding one or both of Wealth and Income. Try to make linear regression models that do not include Income and Wealth together. Some models are suggested on the next question. Report all your trials and determine whether models are significant.

4) Summarize all the models you made for Bank balance models filling the below table. You do not have to make all the suggested models. You may try other models. Add more rows if you need.


Adj-r2

X variables

Problems in the model

Model 1

0.944

All variables

Education and Home value are not significant

Model 2

0.944

Without Home value

Multicollinearity between Income and Wealth

Model 3


Without Home value, Wealth


Model 4


Without Home value, Income


Model 5


Without Wealth


Model 6


Without Income


Model 7


Without Income and Wealth


Model 8


Without Income, Wealth, Home value


Model 9




5) Probably, you have made a couple of significant models that do not have multicollinearity. Then, which model do we have to choose? As we discussed in the class, adding any good or bad variables in a model increases or do not change r2. A good model has a high r2. But a high r2does not guarantee a good model. As a remedy, we use the adjusted r2 to compare the quality of linear regression models. Based on the adjusted r2, choose your best significant model and describe it.

6) In regression modeling, there is a principle known as parsimony; the simpler the model is, the better. If you found multiple models that provide adequate interpretations of the dependent variable at a certain quality, it would be better to choose a simpler model (few number of X variables). Taking into account the parsimony principle, choose your best model, make detailed report, check assumptions, and interpret coefficients. Finally, describe your findings and analysis about bank balances. How can the bank use the model and data?

Reference no: EM131971457

Questions Cloud

Explain the corporate diversification strategy in project : Prepare your slides to cover the following topics or questions described for this assignment. Be sure that you address the concepts of cost leadership.
Example of classical probability : Classify each statement as an example of classical probability, empirical probability, or subjective probability. Explain your reasoning.
Development and adoption of the accounting software packages : What is the most likely system acquisition method- commercial software, custom software, or ERP - Identify any control problems in the system and what sorts
How can policies and regulation impact financial institution : What role do financial institutions play within the global marketplace? How can policies and regulations impact financial institutions?
Make a model that predicts the average bank : Make a model that predicts the average bank balance with all available independent variables (Age, Education, Income, Home value, and Wealth).
In the realm of decision making what are assumptions : Describe at least 3 criteria that would determine whether the manager is making good decisions. In the realm of decision making, what are assumptions?
Create a draft product backlog for the project : ITECH2101 - Software Engineering: Processes and Methods Assignment - Need to create a draft product backlog for the project described
Independent exponential random variables : For an n-component series system in which the component lifetimes are independent exponential random variables with respective parameters
How will your older employees feel about the new system : How will your older employees feel about the new system and what can you do to ensure a smooth transition?

Reviews

Write a Review

Basic Statistics Questions & Answers

  Test that has a mean

If I obtain a score of 100 on a test that has a mean of 120 and a standard deviation of 10, what is my standard score?

  If a random sample of 40 specimens is selected what is the

suppose the sediment density gcm of a randomly chosen specimen is known to have mean value 2.65 and standard deviation

  Which of these variables is qualitative

An office supply catalogue gives a description of bookshelves that includes the following variables. Which of these variables is qualitative?

  Problem on health conditions and risk behaviors

Health conditions and risk behaviors. The data file BRFSS gives several variables related to health conditions and risk behaviors as well as demographic.

  Does the breakfast revenue seem to increase as the number

Does the breakfast revenue seem to increase as the number of occupied rooms increases? Draw a scatter diagram to support your conclusion.

  Find the median and z-scores of distribution

A psychologist studied self-esteem scores and found the data set to be normally distributed with a mean of 50 and a standard deviation of 6. A raw score of 47 is associated with what percentile?

  A researcher is testing a possible carcinogen on mice he

a researcher is testing a possible carcinogen on mice. he chooses 8 mice at random from a batch of 13 mice. unknown to

  Bootstrap distributions mimic the sampling distribution

The effect of non-Normality. The populations in the two previous exercises have the same mean and standard deviation, but one is very close to Normal.

  Does caffeine affect the rate of tapping

Thirty men in college were taught a method of finger tapping. They were randomly assigned to three groups of ten, with each receiving one of three doses.

  Estimating the average loss of production per acre

What would be your recommendation to the local Farm Bureau for estimating the average loss of production per acre, given the data you have at hand?

  Linear trend model with seasonal dummies

a. Estimate a linear trend model with seasonal dummies. b. Estimate a quadratic trend model with seasonal dummies.

  One-way anova and post-hoc test using spss

In this paragraph, be sure to make it clear how you handled the issue of conducting three planned comparisons. Also provide the ANOVA summary table for these results.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd