Build a linear regression model to predict life

Assignment Help Other Subject
Reference no: EM132951072

Big Data Analytics Assessment

Overview

Write Python code (especially PySpark where possible) to implement the tasks below (Please explain your code in detail in your report). Where appropriate, the output of code execution should be presented that relates to your answer to the tasks into your report, as the evidence of working program (e.g., screenshots). Output included without supporting explanation or interpretation will not receive credit.

Where the tasks involve big data analytics with machine learning, you may need to formulate technical solution step by step, e.g., choose appropriate machine learning models / algorithms, justifying appropriateness of models/algorithms/techniques used, applying them in the context of the given task, and practising data visualisation techniques where appropriate. Test your solution and conduct experiments where appropriate. Evaluate the performance of implemented solution and analyse results where appropriate. Delve into deep technical explanations of the results and suggest possible improvement, etc. Make conclusions where appropriate.

You need to present your solution to the tasks into a technical report. The report should be submitted onto Blackboard.

Assignment tasks

Part (I)

You will need to download the data file "Medical_info.csv" from the Blackboard. The dataset is about a study of factors predicting serum testosterone levels in a screened population in some country. The key to the variables in the dataset is explained in the table below. Please note there are null or missing values in some columns of the dataset.

id

Identification Number

age

Age (years)

BMI

Body Mass Index (kg/m2)

PSA

Prostate Specific Antigen (ng/mL)

TG

Serum Triglyceride (mg/dL)

Cholesterol

Serum Cholesterol (mg/dL)

LDLChole

Serum Low-Density Lipoprotein (mg/dL)

HDLChole

Serum High-Density Lipoprotein (mg/dL)

Glucose

Fasting Plasma Glucose (mg/dL)

Testosterone

Serum Testosterone (ng/mL)

BP_1

Blood pressure category, coded as 1=Low, 2=High

1. Load the data file into a Spark DataFrame (1st DataFrame). Describe the structure of the DataFrame.

2. Create a new DataFrame (2nd DataFrame) by removing all the rows with null/missing values in the 1st DataFrame and calculate the number of rows removed.

3. Calculate summary statistics of the ‘age' feature in the 2nd DataFrame, including its min value, max value, mean value, median value, variance and standard deviation. Generate a histogram for the ‘age' feature and describe the distribution of the feature.

4. Display the quartile info of the ‘BMI' feature in the 2nd DataFrame. Generate a boxplot for the ‘BMI' feature and discuss the distribution of the feature based on the boxplot.

5. Use Spark DataFrame API (i.e., expression methods) to count the number of rows where ‘age' is greater than 50 and ‘BP_1' equals 1.

6. Use the ‘BP_1' feature in the 2nd DataFrame as the target label, to build two classification models based on all other columns as predictors. Conduct performance evaluation for the two models and make conclusions.

Part (II)

You will need to download the data file "Region_info.csv" from the Blackboard. The dataset provides information such as population size, average life expectancy, GDP per capita, and so on for some regions. The key to the variables in the dataset is explained in the table below.

Population

Identification Number

fertility

Average number of children a woman in a given country gives birth to

HIV

HIV infection rate

CO2

CO2 emission (tonnes per person)

BMI_male

Body Mass Index (male)

GDP

Gross Domestic Product

BMI_female

Body Mass Index (female)

life

Average life expectancy

child_mortality

Death of children under 5 years of age per 1000 live births

region

Region

1. Load the data file into a Spark DataFrame (1st DataFrame). Describe the structure of the created data frame.

2. Create a new DataFrame (2nd DataFrame) by removing the ‘region' column.

3. Use a graph, explore and describe the relationship between ‘fertility' feature and ‘life' feature in the 2nd DataFrame.

4. Use Spark SQL query to display the ‘fertility' and ‘life' columns in the 2nd DataFrame where ‘fertility' is great than 1.0 and ‘life' is greater than 70.

5. Build a linear regression model to predict life expectancy (the ‘life' column) in the 2nd DataFrame using the ‘fertility' column as the predictor. Conduct performance evaluation for the model and make conclusions.

6. Build a Lasso regression model to predict life expectancy (the ‘life column) in the 2nd DataFrame using all other columns as the predictor. Conduct performance evaluation for the model and make conclusions.

Attachment:- Big Data Analytics Assessment.rar

Reference no: EM132951072

Questions Cloud

What is the residual income for the division : The Commercial Division of Galactica Company has operating income of $217,000 and assets of $804,000. What is the residual income for the division
Why different organisations will have different models : Explain why different organisations will have different models and what some of those models might be based on. Identify at least eight different service models
What is the cost of common equity : The stock is expected to pay a dividend of $2.25 a share at the end of the year (D1 = $2.25), What is the cost of common equity
What should richman co report as unrealized gain : Richman Co. purchased P300,000 of 8%, 5-year bonds from Carlin. What should Richman Co. report as unrealized gain to other comprehensive income for year 2022?
Build a linear regression model to predict life : Build a linear regression model to predict life expectancy (the ‘life' column) in the 2nd DataFrame using the ‘fertility' column as the predictor
Prepare a statement of change in equity for the month : Prepare a statement of change in equity for the month ended in February 28,2017. Assume a loss of $7000 was realized in January.
What will the bond trade for today : The bond will be redeemed for $1,000 at that time. If investors are looking for a 6.00% annual return to hold the bond, what will the bond trade for today
How much will each partner receive for mr x : Mr X., Mr.Y. and Mr. Z, Mr. X contributed $60000, Mr. Y contributed $50000 and Mr. Z contributed $30000. How much will each partner receive?
What is depreciation expense in the first full year : One piece of equipment was purchased for $400,000, with an estimated useful life of 8 years and residual value of $40,000. What is depreciation expense

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd