Reference no: EM133689713
Introductory Econometrics
Data Analysis Project
Introduction - One of the hypotheses that have been widely discussed in the literature of environmental economics is the Environmental Kuznets Curve (EKC). It states that the relationship between a country's national income and the extent of environmental degradation is in an inverted U- shape. That is, the extent of environmental degradation increases with national income at a diminishing rate and starts decreasing as the national income increases further beyond a certain level. In this project, we will test the EKC hypothesis empirically using data from the World Bank
Data collection
To do this project, you need to download the following data from the World Bank's (WB's) World Development Indicators (WDI) website
Please follow the steps below to download these data from the WB's website:
Expand "Country" tab on the left-hand side of the website and choose all countries. To do this, you need to select "Countries" out of three options, then select all countries by ticking the box on the next line. You should see that you have selected 217 countries. (see Image 1 at the end of this document)
Expand "Series" tab and search the required data series by the WB indicator name or data code listed above. Go through the search results and tick the box next to the intended variables (pay attention to the measurement as well). (see Image 2)
Move to the "Time" tab and select "2020" by ticking the box next to it. (see Image 3)
Click "Apply Changes" on the right-hand side of the website. (see Image 3)
Under "Download options," choose "Advanced Options". (see Image 3)
In the popup window, select "Names only" within "Variable format:" option. (see Image 4)
Click "Download" and save the file in your local drive.
Data Cleaning/Formatting
Before analyzing the data on R/RStudio, you need to follow several steps to clean and rearrange them.
First, opening the data downloaded from WDI on Excel. You notice that the data are arranged as
Field/variable names appear on row 1 of the worksheet "Data" and actual data are stored on rows 2-869. Scrolling down the sheet, you find the following texts in lines 873 and 874:
Data from database: World Development
Please delete these two lines and save the Excel file under the same name. (see Image 5)
Next, we need to convert the data format from a long form (data on 4 variables from 217 countries are stacked vertically in one column) into a wide form (data are stored in a table form so that the first column stores the country name and subsequent columns store the data on one variable in each column). There are many ways to perform this transform, but one possible (probably the easiest) way is to execute the following on R/RStudio:
dat = readxl::read_excel("[path]/Data_Extract_From_World_Development_Indicators.xlsx", sheet
= "Data")
datw = spread(dat, "Series Name", "2020")
We are familiar with the first line, which reads the specified Excel spreadsheet into the workspace (you might need to adjust the file path). The second command converts the data from a long form into a wide form and save the new data as "datw." A new data matrix "datw" should contain the country name in the first column and the data on four variables in columns 2-5. (Please check this on R/Rstudio.)
We also want to shorten the variable names so that they are easier to handle. We can try:
datw = rename(datw, CO2 = "CO2 emissions (metric tons per capita)", GDPpc = "GDP per capita (constant 2015 US$)",
PopDen = "Population density (people per sq. km of land area)", UrbPop = "Urban population (% of total population)")
Now, a new data matrix "datw" contains the country name in the first column and the data on four variables (CO2, GDPpc, PopDen, and UrbPop) in columns 2-5.
Next, we want to convert missing values from ".." into "NA" and eliminate them from dataset. This can be done by:
datw[datw==".."] = NA datw = na.omit(datw)
The first line changes ".." into "NA" (the default value for missing observations in R). The second line eliminates these missing observations from "datw."
Finally, we need to change the data type from character to numerical for the six variables. This can be done by:.
class(datw$CO2) = "double" class(datw$GDPpc) = "double" class(datw$PopDen) = "double" class(datw$UrbPop) = "double"
Now, we are ready to analyze the data.
Data Analysis
Analyze the WDI data using R/RStudio and answer the following 11 questions.
Create a new variable "CO2k" by converting the data on CO2 emissions from metric tons per capita into kilograms (kg) per capita (by multiplying the original data "CO2" by 1,000). Then, create a scatter plot of CO2 emissions per capita in kg (vertical axis) against per capita GDP (horizontal axis). Please label each axis clearly.
Under the assumption that CO2k (CO2 emissions per capita in kg) is distributed independently and identically in the population, construct a 90% confidence interval of the population mean of CO2 emissions per capita (in kg) manually (that is, using the sample mean, sample variance, and the appropriate critical values obtained from either R or statistical tables). Interpret the calculated confidence interval.
Estimate a multiple regression model with CO2 emissions per capita (in kg) as the dependent variable, and GDP per capita, GDP per capita squared, population density, and the share of population living in urban areas as explanatory variables. Write down the estimated sample regression equation.
For the regression model estimated in Question 3, interpret the reported R- squared as well as the standard error of the regression. Briefly comment on the model's goodness of fit to the observed data.
For the regression model estimated in Question 3, provide interpretations of the estimated coefficients for PoPDen and UrbPop.
For the regression model estimated in Question 3, test if the true population coefficient for PoPDen is negative at a 10% test size, using a critical value approach. State clearly the null and alternative hypothesis.
For the regression model estimated in Question 3, construct a 99% confidence interval of the true population coefficient for UrbPop manually (that is, using the estimated coefficient, standard error and the appropriate critical values obtained from either R or statistical table). Interpret the obtained confidence interval.
Using the regression model estimated in Question 3, calculate the predicted values of CO2k for a range of GDP observed in the sample (with 1,000 increments) whilst keeping the values of PopDen and UrbPop at their respective sample means. Create a two- dimensional diagram with the predicted values of CO2k (vertical axis) is plotted against GDP (horizontal axis). Briefly describe the relationship between CO2 emissions per capita and GDP per capita as implied by the estimated regression model. Does this have the shape you expected? Explain why/why not?
Based on the model estimated in Question 3, find the level of GDP per capita where the effect of GDP per capita on CO2 emissions changes its sign. Briefly comment on how this relates to your answer to Question 8 above.
Following the prompts provided below, test a joint hypothesis that the true population coefficients for PopDen and UrbPop are both equal to zero at a 5% significance level.
Formulate the null and alternative hypotheses.
Write down the regression model(s) that need to be estimated to test the hypotheses formulated in (i).
Estimate the required regression model(s) and calculate the necessary test statistics.
Obtain the relevant critical value(s) and determine whether to reject or not to reject the hypotheses.
Further Instructions
This is an individual project, not a group project. You are required to work and compose your report individually.
You need to conduct all data analysis and compose your report using RMarkdown. A starter RMarkdown file "ECOM5000_pj_2024_Starter.rmd" will help you on initial setup and load in data. You need to complete the rest of coding to perform the necessary data analysis.
You need to submit two files through Blackboard:
A complete RMarkdown file (".rmd" file), and
A report (in pdf format) created by knitting your markdown file (i). It should contain your R code for data analysis, output from data analysis, and text-based answer.