Assessment: Individual Problem solving task

Learning Outcomes

This assessment assesses the following Unit Learning Outcomes (ULO) and related Graduate Learning Outcomes (GLO):

ULO 1: Apply suitable clustering/dimensionality reduction techniques to perform unsupervised learning of data in a real-world

In this assignment, you need to demonstrate your skills for data clustering and dimensionality reduction. There are two parts of this assignment

This is an individual assessment task of maximum 20 pages including all relevant material, graphs, images and tables. Students will be required to provide responses for series of problem situations related to their analysis techniques. They are also required to provide evidence through articulation of the scenario, application of programming skills, analysis techniques and provide a rationale for their response

Task A - Clustering
Download BBC sports dataset from the Cloud. This dataset consists of 737 documents from the BBC Sport website corresponding to sports news articles in five topical areas from 2004-2005. There are 5 class labels: athletics, cricket, football, rugby, tennis. The original dataset and raw text files can be downloaded from here

1. There are 3 files in the dataset corresponding to the feature matrix, the class labels and the term dictionary. You need to read these files in Python notebook and store in variables X, trueLabels, and terms.

2. Next perform K-means clustering with 5 clusters using Euclidean distance as similarity measure. Evaluate the clustering performance using adjusted rand index and adjusted mutual information. Report the clustering performance averaged over 50 random initializations of K-means

3. Repeat K-means clustering with 5 clusters using a similarity measure other than Euclidean distance. Evaluate the clustering performance over 50 random initializations of K-means using adjusted rand index and adjusted mutual information. Report the clustering performance and compare it with the results obtained in step 2

4. For clustering cases (Euclidean distance and the other similarity measure), visualize the cluster centres using Tag cloud using Python package WordCloud.

Task B - (Dimensionality Reduction using PCA/SVD

For the provided BBC sports dataset, perform PCA and plot the captured variance with respect to increasing latent dimensionality. What is the minimum dimension that captures (a) at least 95% variance and (b) at least 98% variance?

PART 2 Excellent Good Fair Unsatisfactory For the provided BBC sports dataset: * Perform PCA * Plot the captured variance with respect to increasing latent dimensionality. * What is the minimum dimension that captures (a) at least 95% variance and (b) at least 98% variance? 5 marks Successfully completed all three tasks. 3 marks Successfully completed any two of the three tasks. 2 marks Successfully completed any one of the three tasks. 0 mark Failed to complete any given task.


For clustering cases (Euclidean distance and the other similarity measure reported in previous two tasks), visualise the cluster centres using Tag cloud using Python package WordCloud 5 marks Successfully used the WordCloud Package to visualise the cluster centres using at least two different similarity measures. 3 marks Successfully used the WordCloud Package to visualise the cluster centres using at least one similarity measure. 2 marks Demonstrated knowledge in WordCloud Package and visualisation, but cannot use them successfully. 0 mark Failed to show any evidence of knowledge in WordCloud Package and visualisation.


Criteria 3: * Repeat K-means clustering with 5 clusters using a similarity measure other than Euclidean distance. * Evaluate the clustering performance over 50 random initializations of K-means using adjusted rand index and adjusted mutual information. * Report the clustering performance and compare it with the results obtained in step 2. 5 marks Successfully completed all three tasks. 3 marks Successfully completed any two of the three tasks. 2 marks Successfully completed any one of the three tasks. 0 mark Failed to complete any given task.


Criteria 2: * Perform K-means clustering with 5 clusters using Euclidean distance as similarity measure. * Evaluate the clustering performance using adjusted rand index and adjusted mutual information. * Report the clustering performance averaged over 50 random initializations of K-means. 5 marks Successfully completed all three tasks. 3 marks Successfully completed any two of the three tasks. 2 marks Successfully completed only one of the three tasks. 0 mark Failed to complete any given task.


Assessment Task 1: Individual problem-solving rubric Criteria Excellent Good Fair Unsatisfactory Criteria 1: Reading files corresponding to the feature matrix, class labels and the term dictionary and store them in variables X, true Labels and terms using Python notebook. 5 marks Successfully read all files and stored in corresponding variables using Python notebook. 3 marks Partially achieved the goal by missing reading or storing one file or variable. 2 marks Only able to either reading files or creating variables in Python to store any value. 0 mark Fail to read and store using Python notebook.


This document supplies detailed information on assessment tasks for this unit. Key information • Due: 22 by 11.30pm AEST • Weighting: 25% • Word count: max 20 pages including all relevant material, graphs, images and tables ULO 1: Apply suitable clustering/dimensionality reduction techniques to perform unsupervised learning of data in a real-world GLO 1: Discipline knowledge and capabilities GLO 3: Digital literacy GLO 4: Critical thinking GLO 5: Problem solving

