What are exploration tools available for interval variables

Assignment Help Database Management System
Reference no: EM131926479

Data Mining Project Assignment

Choice One: Data Analysis

In this project, you are asked to identify a dataset suitable for data mining purposes, and perform data mining tasks, such as classification, association analysis, clustering etc, to the datasets, and report your results and observations.

The following is the step-by-step suggestion to finish the project and the report.

Step 1. Identify suitable datasets and application

In order to identify a suitable dataset to use, start with an application domain that interest you. There are many public available datasets available such as NBA performance data, climate data, intrusion detection benchmark data, manufacturing data, and public wish list data. The class notes contain a collection of websites (in the first self-learning slides) but you are encouraged to use search engine to identify your own datasets that interest you. Be prepared to spend substantial time in preprocessing and exploring the datasets that you choose.

The data set needs to have at least 20 variables and at least 3,000 observations; OR at least 10 variables and at least 30,000 observations. You need to have both categorical and quantitative interval variables.

What to turn in? The final report for this section should contain the following components:

a. An introduction of the application domain that you are interested in

b. A description of the dataset that you selected. It should include details such as the origin and size of the datasets, how the data is represented, e.g. graph, records, attribute, statistics of the values of the attributes...

c. Describe any exploratory analyze and results that you have performed on your datasets.

d. The raw and processed datasets (a brief summary and comparison).

e. A formal problem definition with what is given, what is the goal, and what are the constraints.

f. A plan for your data mining task.

Step 2. Perform data mining tasks on your dataset

In this phase, you will try the intended data mining tasks on your dataset. You can use SAS Enterprise Miner or write your own scripts/codes, or a combination to mine the data. Select an alternative method or several alternative methods to compare your method with. The alternative method(s) could be tree classification or logistic regression, trees with different max number of branches, clustering with different distance choices etc. Compare the results of different methods.

What to turn in? A report describing

a. Your method, e.g. the algorithm, the workflow and any other tasks that you performed.
b. You experimental results.
c. Comparison of the experiment results of your method and the alternative methods.
d. Possible explanations for the experiment results.

Step 3. Make the conclusion

Summarize what you did and what you have learned from this data mining tasks. Describe any future work you think is worthy while.

The final report should be no longer than 25 pages (single-sided, single space, letter size 12). So please pick the most important information to include in the report. Points will be deducted if the report is too long.

Choice One: Free Data Mining Software Evaluation

In this project, you are asked to choose some free data mining software and write to evaluate it or try to write a report to tell us how to use it. Choices of free data mining software and where to download it can be found in Topic 1 folder in D2L.

The following is the step-by-step suggestion to finish the project and the report.

Step 1. Download the software

Include (and not limit to) the following in your report:

- Where to download the software?

- Is there any requirement for the operating system? Can it be run on both Mac and Windows machine, etc.

- How large is the package?

- In general, is the downloading and installation straightforward? Anything need attention during the downloading and installation? If yes, provide step-by-step guidelines.

- The platform (the look) of the software after installation, and the general instruction of each component.

- Some other things you thought of that can show the characteristic of the software.

Step 2. Choose a data set and import it into the software

You can use any data set we used in this class for your project, including the IRIS data. Note that the IRIS data is so popular that it may already exist as one of the built-in data sets in the software.

Include (and not limit to) the following in your report:

- The requirement for the data format, or structure.

- Does the software support for mining very large database? What is the maximum data the software can handle (maximum sample size, maximum number of observations, maximum number of variables, maximum number of levels/categories for a class variable, etc)?

- Does it support for multiple formats? Or is it easy to transform other formats to the formats the software requires?

- Briefly introduce how to import a data into the software. If any format transformation is needed, please explain how to do it also.

- How to set up the modeling rules and measurement levels for variables?

Step 3. Data Exploration

Include (and not limit to) the following in your report:

- Does the software support for graphs, maps, tables, rotation, etc?
- What are the exploration tools available for interval variables?
- What are the exploration tools available for class variables?
- Illustrate some explorations based on your data (refer to homework 2, problem 2).

Step 4. Data Preparation

Include (and not limit to) the following in your report:

- How does the software identify missing, inconsistent, or incorrect data?
- How does the software fix the above problem?
- How does the software perform data conversion and transformations?
- How does the software assist with the sampling process?
- How does the software assist with selection of independent variables (before any modeling)?

Step 5. Modeling

Run at least ONE model for the data, and include the following in your report if it is appropriate for the model you illustrate:

- Does the software support for major prediction (tree, regression, neural network, nearest neighbor, etc.) and description (clustering, principle components, etc.) approaches?

- How many data mining techniques are supported?

- The detailed illustration on how to run the model you choose. How to set up the parameters, how to run the model, how to get the results, and how is the result and running time compare to Enterprise Miner?

- How is the scoring process (scoring means applying the model on a new data set) performed in the software?

If you built more than one model, also consider the following in your report:

- Does the software support model comparison? If yes, how?

Step 6. Conclusion and some additional evaluation

Overall, is the software easy to learn? Is it easy to use? How does it compare to Enterprise Miner in general? Would you recommend it to new data miners? ....

The final report should be no longer than 25 pages (single-sided, single space, letter size 12).

About the Presentation

If you decide to present on SAS day, it's a poster presentation. Please contact Dr. Priestley as soon as possible to reserve a slot and to get some suggestion about the format of the poster.

Other teams, please also prepare the slides for an in-class presentation about what you did.

We will discuss in class how long you have for the presentation. Every person should present part of the project. You will get a separate grading for the presentation. Please refer to the syllabus for the grading rubrics.

Some suggestions for the slides

1. Contain clear outline/agenda/schedule

2. Avoid using more than six lines of text and minimize the number of words on each visual aid

3. Simple is better, avoid a lot of unnecessary formatting. We are interested in your technical content, not your PowerPoint skills.

4. Put your company or university logo on your title slide only; this is a technical presentation to your peers, not a marketing pitch to a customer

5. Use spell check.

6. Avoid flashy Christmas light multiple colors and other distracting means

Some general suggestions for presentations:

• A fast presentation is one slide per minute. A more relaxed pace would be two minutes per slide

• Practice the presentation. There are grading rubrics in the syllabus, which gives the expectation of an outstanding presentation.

• Time your practice sessions to ensure you keep within your allotted time. Remember a team has a time limit and points will be deducted if the presentation is too long.

• Never read the slides verbatim.

Reference no: EM131926479

Questions Cloud

What is the conventional b-c ratio for this project : An improvement project on the Ohio River cost $6,500,000.00 with annual maintenance of $130,000.00.
Draw a free-body diagram from behind the cart : You must draw a free-body diagram from behind the cart and start with formulae from the formula sheet. Using memorized final equations will not be adequate.
How does the current ratio compare with the quick ratio : How does the current ratio compare with the quick ratio and why would these two ratios be important for retailers?
What is the option value per bond : Aztec’s convertible bonds each have a face value of $1,000 and a market value of $1,041.25. What is the option value per bond?
What are exploration tools available for interval variables : What are the exploration tools available for interval variables? How does the software assist with selection of independent variables (before any modeling)?
What is the new temperature : The balloon is lying in the sun, which causes the volume to expand by 6%. What is the new temperature,
What are the MIRRs when you adjust for unequal lives : What are the MIRRs for the Singing Fish Foods projects? What are the MIRRs when you adjust for unequal lives?
What will you be looking for in these earnings reports : How important do you think corporate earnings will be over the next few weeks given the November-December rise in the US stock market?
What is the resultant field intensity oat a point : Two charges of +12 nC and +18 nC are separated horizontally by 28 mm. What is the resultant field intensity oat a point located 20

Reviews

Write a Review

Database Management System Questions & Answers

  Knowledge and data warehousing

Design a dimensional model for analysing Purchases for Adventure Works Cycles and implement it as cubes using SQL Server Analysis Services. The AdventureWorks OLTP sample database is the data source for you BI analysis.

  Design a database schema

Design a Database schema

  Entity-relationship diagram

Create an entity-relationship diagram and design accompanying table layout using sound relational modeling practices and concepts.

  Implement a database of courses and students for a school

Implement a database of courses and students for a school.

  Prepare the e-r diagram for the movie database

Energy in the home, personal energy use and home energy efficiency and Efficient use of ‘waste' heat and renewable heat sources

  Design relation schemas for the entire database

Design relation schemas for the entire database.

  Prepare the relational schema for database

Prepare the relational schema for database

  Data modeling and normalization

Data Modeling and Normalization

  Use cases perform a requirements analysis for the case study

Use Cases Perform a requirements analysis for the Case Study

  Knowledge and data warehousing

Knowledge and Data Warehousing

  Stack and queue data structure

Identify and explain the differences between a stack and a queue data structure

  Practice on topic of normalization

Practice on topic of Normalization

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd