How the crisp-dm methodology

Assignment Help Data Structure & Algorithms

Reference no: EM132979021

Question: You are required to access a large data set and apply the CRISP-DM methodology to meaningfully clean, transform, analyse and evaluate it. As part of this process, you are required to subsequently apply one or more machine learning technique(s) of your choice to perform classification, association, numerical prediction and/or clustering tasks (or combinations thereof).

You will present the outcome of the above tasks in the form of a technical report containing the five sections listed in Table 1.

As shown in Table 1, a page limit of 10 pages is recommended. The report, in total, however, must not exceed 13 pages (excluding title page, contents page, references, bibliography and appendices) with a minimum font size of 10 pitch. A penalty of a single grade will be incurred if you exceed the 13-page limit. Further information (supporting experimental results) can be added as appendices.

You are free to select the style of the report (i.e., section headings and format, etc.) although it must obviously address the content listed in Table 1.

Section	Weighting	Recommended Pages per Section
1. Introduction and Business Context	0.1	1
2. Data Selection and Pre-Processing	0.3	3
3. Machine Learning Method(s) and their Implementation	0.3	3
4. Evaluation of Results	0.2	2
5. Discussion	0.1	1

You are expected to submit the following electronic files to the designated repository by the submission deadline:
• Training, validation and test sets (before and after pre-processing). Note that if cross-validation is used, only the training and test sets are required;
The remainder of this section provides you with detailed requirements for each area of content.

PROJECT TITLE
Table of Contents
1. Introduction
2. Data Selection and Pre-processing
3. Machine Learning Method(s) and their Implementation
4. Evaluation of Results
5. Discussion References Bibliography
A. Appendix

1. Introduction
You should also look at how to use automatic referencing and citing (Harvard referencing style). You may want to create ‘sections' (e.g. Insert ->Break->Section break /continuous or next page) so that you can apply different formatting to different sections of the document. You should make use of sub-section headings (via relevant Styles) where appropriate Section 1 includes:
? A brief narrative as to how your organisation or company currently performs data analytics and how the CRISP-DM methodology may help your organisation to better meet its strategic priorities with respect to data analytics and business intelligence.
? A brief overview of the data analytic task you are going to perform.
? A clear justification as to why the task you are attempting is of value to your business or more broadly, industry, government, university research and/or the community. You should support your justification with references to the appropriate industry or academic literature.
? State the insight you intend to gain

2. Data Selection and Pre-processing
? Select a data set consisting of at least 2,000 observations/records and preferably above 10,000. You are strongly encouraged to identify an anonymised data set, the strategic objectives of the business. However, if this is not possible then you are advised to select a data set from one of the following sources:
? Briefly describe your data set and reference its origin.
? If you have 15 or less attributes, table your attributes with attribute name, description and data type and then show the minimum/average/maximum and stdev values for the training set and test set. For nominal variables, then show the most and least frequently occurring nominal value(s). If you have more than 15 attributes, then group attributes into themes (e.g. customer, orders, employees) and describe the type of information and data types in each theme including number of each variable type (e.g. nominal, interval, ratio etc). You may want to highlight significant variables identified by some attribute selection algorithm.
? Briefly table the following characteristics of the entire data set: number of instances, patterns per target class (if classification), limitations such as possible conflicting patterns, missing values, outliers/erroneous values.
? Explain how you have sampled your data to create the ‘in sample' and ‘out of sample' data sets. If you have used instance weightings to balance your data set(s) then explain how the weightings were determined.
? Provide a statistical summary in tabular form for the resulting ‘in sample' (training/validation set) and ‘out of sample' (test set). Also, state whether or not there was any overlap in training and test set instances and if so, justify why your test set is not compromised.
? What pre-processing and transformation was performed on the variables and why? (e.g. standardising numerical variables and/or using scaling, taking logs to reduce skewness, or log differences to reduce non-stationarity; converting numerical variables to discrete ones; converting numerical or symbolic patterns into bit patterns; removing patterns with missing or outlier values; adding noise or jitter to patterns to expand the data set; adding instance weightings or replicating certain pattern classes to improve class distributions; transforming time-series data into static training/test patterns)
? How did you ensure that your pre-processing did not compromise your test set (e.g. use of standardisation)
? Consideration will also be given to the ‘curse of dimensionality', its issues and how its impact can be reduced. o If you reduced the number of dimensions (e.g. from 30 attributes to 10 attributes), how did you do this? Autoencoder? PCA? Filter using InfoGain measurement? A clusterer? How do these methods work and what are their advantages/disadvantages?
? If you increased the number of training instances, how did you do this?

3. Machine Learning Method(s) and their Implementation
? Clearly state the machine learning methods you will be using the function(s) you will be expecting them to perform (e.g. classification, association, regression, clustering or combinations thereof for self-supervised learning). You must describe the expected ‘input to' and ‘output from' each model.

? Explain and justify the machine learning method(s) chosen for the task. You must also use a simple benchmark model with which to compare your chosen machine learning model(s) (e.g. benchmark a neural network trained with back-propagation against a simple OneR or Naive Bayes approach).
? Briefly highlight the strengths and weaknesses of the chosen learning method(s).
? Describe your ‘model fitting' and ‘model selection' process (e.g. leave-one-out validation, cross-validation, bagging and boosting etc). You must state and justify the hyper-parameters used for model fitting and how ‘over-training' will be minimised.
? Describe what tool will be used to implement the machine learning method(s) (e.g.Weka/Java).
? You must either: a) use advanced features of the chosen analytics tool including (though not limited to) clear evidence of meaningful programming/scripting activity to use machine learning and/or pre-processing tools in a bespoke way (e.g. install and use advanced Weka packages via Package Manager - examples might be: simple recurrent network, convolutional neural network, Self-organizing maps, Time Series processing with ARMA models).
OR ? provide an in-depth mathematical treatment of the chosen machine learning method(s) with clear explanation as to how you will optimise them using the built-in features of the data analytics tool.

4. Evaluation of Results
? Table the resulting ‘in sample' (training) and ‘out of sample' (test) performance of your model for the different model configurations and trial runs (e.g. a neural net with different number of hidden nodes, different random starting weights and or different learning rates). You should at least use one or more of the performance metrics (as appropriate):
o Percent correct/incorrect
o Confusion matrix
o Recall and precision
o Evaluating numeric prediction (e.g. mean squared error (MSE), root mean squared error (RMSE), correlation coefficients)
o ROC curve
? Critically review the performance of the different models. Which type of pre-processing appeared to be most advantageous and why? For each model, which hyper-parameter settings (e.g. learning rate, prune Tree, momentum term) were most effective?
• ? Critically compare models - was there a model or model class whose performance on the test set was statistically significantly better than the other models/model classes (with a p- value < 0.05) (may be using Experimenter in Weka)?

5. Discussion

? Briefly summarise your task and your findings (i.e. whether the model learnt the problem).
? How do your findings relate to similar tasks found in the relevant industry or academic literature?
? Did you gain the insight you intended to? If not, what else could you do to enhance the usefulness of your analytics?

? How did you decide on the most appropriate machine learning method and what do you understand about appropriateness?
? Finally, briefly state how you are going to use the knowledge and skills you have developed in the module to further your professional ambitions and/or the strategic objectives of your organisation?

References
Harvard referencing style.

Attachment:- Machine Learning Assignment.rar

Reference no: EM132979021

Questions Cloud

Prepare journal entries to record each expenditures : Accumulated depreciation $20,000.00. Prepare journal entries to record each of the above expenditures that were made during the year.

Find that mutations are retained by organisms for adaptation : Do you find that mutations are retained by organisms for adaptation more than ever compared to the earlier years (i.e. turn of the century)?

Responsibilities of your career : As studied this week in the cell cycle, we saw how a cell moves through its life with a plan. As you transition from a student at UMA to a valued member of your

Explain why cpg islands can suppress gene expression : Explain why CpG islands can suppress Gene Expression.

How the crisp-dm methodology : How do your findings relate to similar tasks found in the relevant industry or academic literature and briefly state how you are going to use the knowledge

What is the amount of dividends received : The board of directors declares and pays a $55800 dividend in 2019 and in 2020. What is the amount of dividends received by the common stockholders in 2020

What is the journal entry : Total billing for this tax is $775,000. What is the journal entry, if any, for the fund financial statements and the government- wide financial statements

What is the correct amount of compensation expense : Accounting Issue: What is the correct amount of compensation expense (if any) that Evergreen should record as of March 31, 2021

What is the actual rate mr miser is charging on his loans : Mr. Miser loans money at an annual rate of 21 percent. Interest is compounded daily. What is the actual rate Mr. Miser is charging on his loans

User Account

All Pages