Reference no: EM133770271
Assessment topic
Data understanding and knowledge-based schemes
Task details:
Advertisements on internet pages can be an inconvenient distraction from the actual content being conveyed by website. Moreover, some advertisements can lure you into fake website with malicious viruses.
In this assessment you will be working on a dataset that represents a set of possible advertisements on internet pages. The original dataset has been sourced from UC Irvine Machine Learning Repository (Kushmerick, 1998). The features encode the geometry of the image (if available) as well as phrases occuring in the URL, the image's URL and alt text, the anchor text, and words occuring near the anchor text. The goal is to predict whether an image is an advertisement ("ad") or not ("nonad"). The data has missing values which needs to be handled.
You will need to prepare the data for mining and perform an exploratory data analysis. The data mining task is to predict whether an image is an advertisement ("ad") or not ("nonad"). An explicit training/test split is not provided so you need to determine a reasonable way of assessing performance. The dataset also has several hundred features (attributes). You need to perform feature reduction to significantly reduce the number of features. Implement at least two different classifiers.
Challenges: There is an imbalance of the number of data per each class. Also, the number of attributes is very high compared to the size of the dataset hence feature reduction is. One or more of the three continuous features have missing data.
The main goal of this project is to build a machine learning model that, given a set of suitable features, will predict whether the image is an advertisement or not. You may limit the initial set of features to features that encode the geometry of the image (if available) as well as phrases occuring in the URL and the image's URL. A truncated file is available on Moodle. You will still need to perform feature reduction on this dataset.
Dataset Description
There are four image features:
height: continuous
width: continuous
aratio: aspect ratio. continuous.
local: image location 0,1.
There are 457 features for url terms and the values are 0 or 1.
There are 495 features for origurl terms and the values are 0 or 1.
Students are recommended to use the following report structure that address the marking rubric criteria:
Introduction: introduces the case study and the objective
Data understanding: Visualize the data. Explain the data. What preparation methods are required? What feature play important role in your prediction model?
Model implementation and evaluation: show the screenshot of your implemented models followed by the explanation. How would you improve the performance of your models? Compared the models against each other.
Insights: Discuss the output that you receive for the implemented models. Explain the knowledge and insight about the examples in the dataset.
Conclusion: Conclude the model implementation and the results you receive from them. Highlight the key points in the insights.
References: follow a consistent format and use SISTC recommended referencing style (refer to unit outline)