Building a regression model with spark

Assignment Help Database Management System
Reference no: EM132119741

Big Data Assignment -

Regression Models - Regression models are concerned with target variables that can take any real value. The underlying principle is to find a model that maps input features to predicted target variables. Regression is also a form of supervised learning.

Regression models can be used to predict just about any variable of interest. A few examples include the following:

  • Predicting stock returns and other economic variables
  • Predicting loss amounts for loan defaults (this can be combined with a classification model that predicts the probability of default, while the regression model predicts the amount in the case of a default)
  • Recommendations (the Alternating Least Squares factorization model from Chapter 5, Building a Recommendation Engine with Spark, uses linear regression in each iteration)
  • Predicting customer lifetime value (CLTV) in a retail, mobile, or other business, based on user behavior and spending patterns

In the different sections of this chapter, we will do the following:

Introduce the various types of regression models available in ML

  • Explore feature extraction and target variable transformation for regression models
  • Train a number of regression models using ML
  • Building a Regression Model with Spark
  • See how to make predictions using the trained model
  • Investigate the impact on performance of various parameter settings for regression using cross-validation

Types of regression models - The core idea of linear models (or generalized linear models) is that we model the predicted outcome of interest (often called the target or dependent variable) as a function of a simple linear predictor applied to the input variables (also referred to as features or independent variables).

y = f(wTx)

Here, y is the target variable, w is the vector of parameters (known as the weight vector), and x is the vector of input features. wTx is the linear predictor (or vector dot product) of the weight vector w and feature vector x. To this linear predictor, we applied a function f (called the link function). Linear models can, in fact, be used for both classification and regression simply by changing the link function. Standard linear regression uses an identity link (that is, y = wTx directly), while binary classification uses alternative link functions as discussed here.

Spark's ML library offers different regression models, which are as follows:

  • Linear regression
  • Generalized linear regression
  • Logistical regression
  • Decision trees
  • Random forest regression
  • Gradient boosted trees
  • Survival regression
  • Isotonic regression
  • Ridge regression

Regression models define the relationship between a dependent variable and one or more independent variables. It builds the best model that fits the values of independent variables or features.

Linear regression unlike classification models such as support vector machines and logistic regression is used for predicting the value of a dependent variable with generalized value rather than predicting the exact class label.

Linear regression models are essentially the same as their classification counterparts, the only difference is that linear regression models use a different loss function, related link function, and decision function. Spark ML provides a standard least squares regression model (although other types of generalized linear models for regression are planned).

Assignment -

1. Utilising Python 3 Build the following regression models:

  • Decision Tree
  • Gradient Boosted Tree
  • Linear regression

2. Select a dataset (other than the example dataset given in section 3) and apply the Decision Tree and Linear regression models created above. Choose a dataset from Kaggle.

3. Build the following in relation to the gradient boost tree and the dataset choosen in step 2

  • Gradient boost tree iterations
  • Gradient boost tree Max Bins

4. Build the following in relation to the decision tree and the dataset choosen in step 2

  • Decision Tree Categorical features
  • Decision Tree Log
  • Decision Tree Max Bins
  • Decision Tree Max Depth

5. Build the following in relation to the linear regression and the dataset choosen in step 2

a) Linear regression Cross Validation

  • Intercept
  • Iterations
  • Step size
  • L1 Regularization
  • L2 Regularization

b) Linear regression Log (see section 5.4)

6. Follow the provided example of the Bike sharing data set and the guide lines in the sections that follow this section to develop the requirements given in steps 1, 3, 4 and 5.

Attachment:- Assignment Files.rar

Verified Expert

The regression line is constructed by optimizing the parameters of the straight line function such that the line best fits a sample of (x, y) observations where y is a variable dependent on the value of x. Regression analysis is used extensively in economics, risk management, and trading. One cool application of regression analysis is in calibrating certain optimistics results

Reference no: EM132119741

Questions Cloud

What is the time necessary for crossing : What is the time necessary for crossing if the hunter wishes to move neither up- stream nor downstream while crossing the river? Answer in units of min.
Transmission axes of the polarizers : What should be the angle # between the transmission axes of the polarizers if it is desired that one-tenth of the incident intensity be transmitted?
What is the ideal speed : What is the ideal speed to take a 80 m radius curve banked at a 30.0° angle?
Determine the x component of velocity : A particle starts from the origin at t = 0 with an initial velocity having an x component of 26.6 m/s and a y component of -14.8 m/s.
Building a regression model with spark : ICT707 Big Data Assignment - Explore feature extraction and target variable transformation for regression models - Building a Regression Model with Spark
Collisions and reflections of confined gas : What is magnitude of the average force experienced by one of the walls of this cube due to the collisions and reflections of this confined gas?
Simple machines to move the massive stone blocks : Discuss how you might use one or more of the simple machines to move the massive stone blocks up the growing pyramid, and into their proper places.
Simple machines to move the massive stone blocks : Discuss how you might use one or more of the simple machines to move the massive stone blocks up the growing pyramid, and into their proper places.
Why earth is not shaped like a cube : Why Earth is not shaped like a cube? Describe 4 reasons with evidence to support each one.

Reviews

urv2119741

11/3/2018 3:27:59 AM

Please add following points in the file which I mentioned below: 1. Introduction 2. Objectives 3. Data source 4. Structure of the database 5. Explanation of machine learning 6. Data preparation 7. Testing 8. Result and recommendation 9. Conclusion 10. References The assignment was done according to the required instructions and intact it was done methodically and i am very satisfied with it, Infact it was ready before the deadline. I would recommend this site to everyone else.

urv2119741

11/3/2018 3:24:31 AM

Big Data Assignment Marking Criteria The Big Data Assignment is comprised of two parts: ? The first part is to create the algorithms in the tasks, namely: Decision Tree, Gradient Boosted Tree and Linear regression and then to apply them to the bike sharing dataset provided. Try and produce the output given in the task sections (also given in the Big-Data Assignment.docx provided on Blackboard). ? The second part is then use those algorithms created in the first part and apply them to another dataset chosen from Kaggle (other than the bike sharing dataset provided).

Write a Review

Database Management System Questions & Answers

  Design and create a database for an airline

CSG1207/CSI5135 Systems and Database Design. You are required to design and create a database for an airline. The database must contain details of the airline's planes, flights, flight instances and staff, as well as supporting data as detailed bel..

  Describe the main capabilities of mysql

describe the main capabilities of MySQL.

  Identified in the system analysis report

An E-R diagram of your database that clearly shows the primary and foreign keys - Database has minimum number of records in each table as specified in assignment instructions.

  Describe database backup and disaster

Describe Database Backup - 2 pages, (b) Describe Disaster Planning - 2 pages, (c) Highlight the importance of the integration of both backups and disaster planning and the impact if both are not effectively executed - 1 page.

  Write a program stored in a file named rain

Write a program, stored in a file named rain.py, that performs the following task. If there is no file with that name in the working directory, then the program outputs an (arbitrary) error message and exits.

  Write a one page report on how you could improve the product

Use the attached database and associated sheets to review and to make sure the forms are working. Review your project; make sure that all the forms are working. Write a one page report on how you could improve the product.

  Advantages of a database management approach

Discuss the advantages of a database management approach to the file processing approach? Give examples to illustrate your answer. Outline the benefits and limitations of the relational database model for business applications today

  Draw an entity-relationship diagram for this database

Draw an Entity-Relationship diagram for this database using UML notation. Be sure to include all the entities mentioned above, together with attributes (including primary key attributes), relationships and multiplicity constraints

  Raw data representation and interpretation

Raw Data Representation and Interpretation

  Comprise a justifiable argument for use of data warehouse

Comprise a justifiable argument for the use of data warehouses, data centers, and data marts in order to support for business intelligence (BI) within the organizational structure.

  Write select statement which returns three columns

Write a SELECT statement which returns three columns: VendorName, InvoiceCount, and InvoiceSum. InvoiceCount is the count of the number of invoices, and InvoiceSum is the sum of the InvoiceTotal column.

  Create all the tables with identified constraints on oracle

SIT103 - Database and Information Retrieval Assignment. Create all the tables with identified constraints on Oracle. (e.g. primary keys, foreign keys, NULL)

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd