Conduct an exploratory data analysis

Assignment Help Other Subject
Reference no: EM131514317

The key frameworks and concepts covered in modules 1-10 are particularly relevant for this assignment. Assignment 3 relates to the specific course learning objectives 1, 2, 3 and 4:

1. apply knowledge of people, markets, finances, technology and management in a global context of business intelligence practice (data warehousing and big data architecture, data mining process, data visualisation and performance management) and resulting organisational change and understand how these apply to the implementation of business intelligence in organisation systems and business processes

2. identify and solve complex organisational problems creatively and practically through the use of business intelligence and critically reflect on how evidence based decision making and sustainable business performance management can effectively address real-world problems

3. comprehend and address complex ethical dilemmas that arise from evidence based decision making and business performance management

4. communicate effectively in a clear and concise manner in written report style for senior management with the correct and appropriate acknowledgment of the main ideas presented and discussed.

Task 1

The goal of Task 1 is to predict the likelihood of rainfall for tomorrow (next day) based on today's weather conditions. In Task 1 of Assignment 3 you are required to use the data mining tool RapidMiner to analyse and report on the weatherAUS.csv data set provided for Assignment 3. You should review the data dictionary for weatherAUS.csv data set (see Table 1 below).

Table 1 Data dictionary for Australian Weather Data set variables

Variable Name

Data Type




Date of weather observation



Common name of the location of the weather station.



Minimum temperature in degrees Celsius.



Maximum temperature in degrees Celsius.



Amount of rainfall recorded for the day in mm.



So-called Class A pan evaporation (mm) in the 24 hours to 9am.



Number of hours of bright sunshine in the day.



Direction of the strongest wind gust in the 24 hours to midnight.



Speed (km/h) of the strongest wind gust in the 24 hours to midnight.



Direction of wind at 9am



Direction of wind at 3pm



Wind speed (km/hr) averaged over 10 minutes prior to 9am.



Wind speed (km/hr) averaged over 10 minutes prior to 3pm.



Relative humidity (percent) at 9am.



Relative humidity (percent) at 3pm.



Atmospheric pressure (hpa) reduced to mean sea level at 9am.



Atmospheric pressure (hpa) reduced to mean sea level at 3pm.



Fraction of sky obscured by cloud at 9am. This is measured in "oktas", which are a unit of eighths. It records how many eights of the sky are obscured by cloud. A 0 measure indicates completely clear sky whilst an 8 indicates that it is completely overcast.



Fraction of sky obscured by cloud (in "oktas": eighths) at 3pm. See Cload9am for a description of the values.



Temperature (degrees C) at 9am.



Temperature (degrees C) at 3pm.



Integer: Yes if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise No.



Amount of rain. A kind of measure of the "risk".



Target variable. Did it rain tomorrow? Yes or No

The Australian Weather dataset contains over 138,000 daily observations from January 2008 through to January 2017 from 49 Australian weather stations. Observations were drawn from numerous weather stations.

In completing Task 1 of Assignment 3 you will need to apply the business understanding, data understanding, data preparation, modelling and evaluation phases of the CRISP DM data mining process.

Task 1.1 Conduct an exploratory data analysis of the weatherAUS.csv data set using RapidMiner to understand the characteristics of each variable and the relationship of each variable to the other variables in the data set. Summarise the findings of your exploratory data analysis in terms of describing key characteristics of each of the variables in the weatherAUS.csv data set such as maximum, minimum values, average, standard deviation, most frequent values (mode), missing values and invalid values etc and relationships with other variables if relevant in a table named Task 1.1 Results of Exploratory Data Analysis for weatherAUS Data Set.

Hint: Statistics Tab and Chart Tab in RapidMiner provide a lot of descriptive statistical information and useful charts like Barcharts, Scatterplots etc. You might also like to look at running some correlations and chi square tests. Indicate in Task 1.1 Table which variables you consider to be the key variables which contribute most to determining whether it is likely to rain tomorrow.

Briefly discuss the key results of your exploratory data analysis and the justification for selecting your five top variables for predicting whether it is likely to rain tomorrow based on today's weather conditions. (About 250 words)

Task 1.2 Build a Decision Tree model for predicting whether it is likely to rain tomorrow based on today's weather conditions using RapidMiner and an appropriate set of data mining operators and a reduced weatherAUS.csv data set determined by your exploratory data analysis in Task 1.1. Provide these outputs from RapidMiner (1) Final Decision Tree Model process, (2) Final Decision Tree diagram, and (3) associated decision tree rules.

Briefly explain your final Decision Tree Model Process, and discuss the results of the Final Decision Tree Model drawing on the key outputs (Decision Tree Diagram, Decision Tree Rules) for predicting whether it is likely to rain tomorrow based on today's weather conditions and relevant supporting literature on the interpretation of decision trees (About 250 words).

Task 1.3 Build a Logistic Regression model for predicting whether it is likely to rain tomorrow based on today's weather conditions using RapidMiner and an appropriate set of data mining operators and a reduced weatherAUS.csv data set determined by your exploratory data analysis in Task 1.1. Provide these outputs from RapidMiner (1) Final Logistic Regression Model process and (2) Coefficients, and (3) Odds Ratios. Hint you will need to install the Weka Extension in RapidMiner, use W-Logistic Regression Operator for this Task 1.3 and you may need to change data types of some variables.

Briefly explain your final Logistic Regression Model Process, and discuss the results of the Final Logistic Regression Model drawing on the key outputs (Coefficients, Odds Ratios) for predicting whether it is likely to rain tomorrow based on today's weather conditions and relevant supporting literature on the interpretation of logistic regression models (About 250 words).

Task 1.4 You will need to validate your Final Decision Tree Model and Final Logistic Regression Model. Note you will need to use the X-Validation Operator; Apply Model Operator and Performance Operator in your data mining process models here.

Discuss and compare the accuracy of your Final Decision Tree Model with the Final Logistic Regression Model for whether it is likely to rain tomorrow based on today's weather conditions based the results of the confusion matrix, and ROC chart for each final model. You should use a table here to compare the key results of the confusion matrix for the Final Decision Tree Model and Final Logistic Regression Model (About 250 words).

Note the important outputs from your data mining analyses conducted in RapidMiner for Task 1 should be included in your Assignment 3 report to provide support for your conclusions reached regarding each analysis conducted for Task 1.1, Task 1.2, Task 1.3 and

Task 1.4. Note you can export the important outputs from RapidMiner as jpg image files and include these screenshots in the relevant Task 1 parts of your Assignment 3 Report.

Note you will find the North Text book a useful reference for the data mining process activities conducted in Task 1 in relation to the exploratory data analysis, decision tree analysis, logistic regression analysis and evaluation of the accuracy of the Final Decision Tree model and the Final Logistic Regression model.

Task 2

Research the relevant literature on how big data analytics capability can be incorporated into a data warehouse architecture. Note Chapter 2 Data Warehousing and Chapter 6 Big Data and Analytics of Sharda et al. 2014 Textbook will be particularly useful for answering some aspects of Task 2.

Task 2.1 Provide a high level data warehouse architecture design for a large stated owned water utility that incorporates big data capture, processing, storage and presentation in a diagram called Figure 1.1 Big Data Analytics and Data Warehouse Combined.

Task 2.2 Describe and justify the main components of your proposed high level data warehouse architecture design with big data capability incorporated presented in Figure 1.1 with appropriate in-text referencing support (about 750 words).

Task 2.3 Identify and discuss the key security privacy and ethical concerns for organisations within a specific industry that are already using a big data analytics and algorithmic approach to decision making with appropriate in-text referencing support (about 750 words).

Task 3

Scenario Dashboard

Los Angeles Police Department (LAPD) are responsible for enforcing law and order in the City of Los Angeles which is the cultural, financial, and commercial centre of Southern California. With a census-estimated 2015 population of 3,971,883, it is the second-most populous city in the United States (after New York City) and the most populous city in California. Located in a large coastal basin surrounded on three sides by mountains reaching up to and over 10,000 feet (3,000 m), Los Angeles covers an area of about 469 square miles (1,210 km2).

LAPD Crime Analytics Unit would like to have a Crime Events dashboard built with the aim of providing a better understanding of the patterns that are occurring in relation to different crimes across the 21 Police Department areas over time in the City of Los Angeles. In particular, they would like to see if there are any distinct patterns in relation to (1) types of crimes, (2) frequency of each type of crime across each of the 21 Police Department areas for years 2012 through to first quarter of 2016 based on the LACrimes2012-2016.csv data set. Note this is a large data set containing over 1 Million records. This Crime Events dashboard will assist LAPD to better manage and coordinate their efforts in catching the perpetrators of these crimes and be more proactive in preventing these crimes from occurring in the first place.

The LAPD Crime Analytics Unit wants the flexibility to visualize the frequency that each type of crime is occurring over time across each of the 21 Police Department areas/districts in the City of Los Angeles. They want to be able to get a quick overview of the crime data in relation to category of crimes, location, date of occurrence and frequency that each crime is occurring over time and then be able to zoom in and filter on particular aspects and then get further details as required.

LA Crimes Data Set Data Dictionary

variable name




1.  character

Original dataset id


2.  date

Date crime was reported


3.  character

Count of Date Reported


4.  date

Date crime occurred


5.  date

Time crime occurred on a day


6.  character

Area Code


7.  character

Area geographical location


8.  character

Nearby road identifier


9.  character

Crime type code


10. character

Crime type description


11. character

Status code


12. character

Status outcome of crime


13. character

Nearby address location


14. character

Nearby cross street


15. numeric

Latitude of crime event


16. numeric

Longitude of crime event


17. numeric

Year of crime occurred


18. numeric

Month of crime occurred


19. numeric

Day of month crime occurred


20. numeric

Hour of day crime occurred



Month and year when crime occurred


22. character

Day of week crime occurred


23. character

Weekday/weekend classification for crime event


24. character

Occurred at an intersection


25. character

subjective binning of crimes

Task 3 requires a Tableau dashboard consisting of four crime event views of the LA Crimes 2012-2016 data set.

Task 3.1 Specific Crimes within each Crime Category for a specific Police Department Area and specific year

Task 3.2 Frequency of Occurrence for a selected crime over 24 hours for a specific Police Department Area

Task 3.3 Frequency of Crimes within each Crime Classification by Police Department Area and by Time

Task 3.4 Geographical (location) presentation of each Police Department Area for given crime(s) and year. Note for this task you will need to make use of the geo-mapping capability of Tableau Desktop.

You should briefly discuss the key findings for each of these four views in your Crimes Event Dashboard (about 60 words each and 250 words in total)

Task 3.5 Provide a rationale (drawing on relevant literature for good dashboard design) for the graphic design and functionality that is provided in your LAPD Crimes Event dashboard for the required four specified crime events views for Tasks 3.1, 3.2, 3.3 and 3.4 (About 750 words). Note Stephen Few is considered to be the Guru for good Dashboard Design and has wrote a number of books on this topic. Worth having a look at his website and in particular his examples of poorly designed dashboard views and his suggestions for better dashboard views.

For your Assignment 3 submission, you will need to submit your Task 3 Tableau workbook in .twbx format which will contain your dashboard, four views and the associated data set as a separate document together with your Assignment 3 Main Report in word docx format.
Report presentation writing style and referencing (worth 10 marks)

Presentation: use of formatting, spacing, paragraphs, tables and diagrams, introduction, conclusion, table of contents

Writing style: Use of English (Correct use of language and grammar. Also, is there evidence of spelling-checking and proofreading?)

Referencing: Appropriate level of referencing in text where required, reference list provided, used Harvard Referencing Style correctly

Verified Expert

In this assignment we have studied different scenario . Here we have studied decision tree.Data analysis about crime.Here we have studied regression model.we have also studied big data and big data analysis . we also studied big data ethical issues . we also studied about rapid minor and how to install rapid minor and make the decision tree and final decision tree .

Reference no: EM131514317

Questions Cloud

State legislature of illinois : In 1998, the state legislature of Illinois added the Solid Waste Import Restrictions to its State Wide Solid Waste Management Act.
Progress reporting to management : Describe five strategies for productivity and progress reporting to management.
Supporting reasons from other recognized sources : Discuss if you agree with Maslow and give supporting reasons from other recognized sources (Journal Articles, etc).
Determining the success of an application : The quality of the user experience is very important to the success of an application. In the early days of computing, users often experienced long delays.
Conduct an exploratory data analysis : Identify and solve complex organisational problems creatively and practically through the use of business intelligence
Why is it desirable to pretest survey instruments : Why is it desirable to pretest survey instruments? What information can you secure from such a pretest? How can you find the best wording for a question.
Determining the guidelines for the employees : Many companies have Codes of Conduct which provide ethical guidelines for the employees.
Undergoing rapid change : The workforce is undergoing rapid change.' Discuss five trends to 2030 that have been identified for the financial sector.
Organizational growth and expansion : 1. Consider what resources are available for organizational growth and expansion.



6/1/2017 4:46:18 PM

Report presentation writing style and referencing (worth 10 marks) Presentation: use of formatting, spacing, paragraphs, tables and diagrams, introduction, conclusion, table of contents Writing style: Use of English (Correct use of language and grammar. Also, is there evidence of spelling-checking and proofreading?) Referencing: Appropriate level of referencing in text where required, reference list provided, used Harvard Referencing Style correctly


6/1/2017 4:46:08 PM

The assignment tho-rely as it has some theory work and some technical work like tabuleas ect...Description Possible Marks and Wtg(%) Word Count Due Date Assignment 3 Written Practical Report 100 marks 40% Weighting 4000 Assignment 3 Report should be structured as follows: Assignment 3 Cover page Table of Contents Task 1 Main Heading Task 1 Sub Tasks – Sub headings for Tasks 1.1, 1.2 and 1.3 Task 2 Task 2 Sub Tasks – Sub headings for Task 2.1, 2.2, 2.3 and 2.4 Task 3 Task 3 Sub Tasks – Sub headings for Task 3.1, 3.2, 3.3, 3.4 and 3.5 List of References List of Appendices

Write a Review

Other Subject Questions & Answers

  What is logical positivism

What is logical positivism? How is legitimate science consistent with logical positivism? Explain your answer.

  Homegirls transgress societal gender norms issues

Identify and discuss two of the hypotheses that have been put forward in the ethnography by Norma Mendoza-Denton to account for girls' involvement in gang activities.

  Producer of pottery

A producer of pottery is considering the addition of a new plant to absorb the backlog of demand that now exists. The primary location being considered will have fixed costs of $9,200 per month and variable costs of 70 cents per unit produced. Each i..

  Describe the overall planning process and the key components

Evaluate the financial statements and the financial position of health care institutions. Describe the overall planning process and the key components of the financial plan. Use technology and information resources to research issues in health financ..

  How does superstitious behavior relate to ocd

Discuss examples of obsessive-compulsive behavior that occur in everyday life and that are not considered abnormal. How does superstitious behavior relate to OCD

  Explain fundamental concerns that a nonprofit should examine

The significant pros and cons of such government contracting discussed by Zadi, et al. Further explain the fundamental concerns that a nonprofit should examine if it is considering engaging in governmental contracting and why.

  Quiet anguish that the characters express

What famous playwright is Wakako Yamauchi compared to because of the small but precisely chosen details with which she renders her characters lives and the quiet anguish that the characters express?

  Delusions and gross disorganization of personality

This is a psychotic disorder that features hallucinations, delusions, and gross disorganization of personality. It is manageable with anti-psychotic drugs, therapy and other psychosocial interventions ___________________.

  What major changes in political structures

What major changes in political structures, and social and economic life, occurred during The Sui dynasty, The Tang dynasty

  The bottom or downward tip of a crevasse marks the

The bottom or downward tip of a crevasse marks the?

  Which of the following best describes the j curve

Which of the following best describes the J curve? The gradual movement of the segments of the Earth's crust against each other does NOT cause

  Explain it is made up of distruptive technologies

Enterprise IT is made up of distruptive technologies

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd