Reference no: EM132361630
Exercise 1 - Write some code to create a list named 'categories' that lists unique categories sorted alphabetically from the 'category' column of the data.
Exercise 2 - Write a function
python
def get_normalized_category_vector(game_categories,categories):
that takes two inputs
1. 'game_categories', a string of comma separated categories (think of this input as an entry from the 'category' column of the data); and
2. 'categories', a list of alphabetically sorted categories created in Exercise 1;
and returns a _normalized category vector_, defined below, as a 1-D numpy array.
A _category vector_ is defined as a vector of 1's and 0's where an entry is 1 if the board game has the corresponding category as one of its categories, or 0 otherwise.
Exercise 3 - Write a function
python
def get_similarity_score(v1, v2):
that takes two normalized category vectors (as 1-D numpy arrays) as inputs and returns a _cosine similarity score_ as an output.
The cosine similarity of two normalized vectors is their dot product. As an example:
python
v1 = np.array([0, 1/sqrt(2), 0, 0, 0, 1/sqrt(2), 0])
v2 = np.array([1/sqrt(3), 1/sqrt(3), 0, 0, 0, 0, 1/sqrt(3)])
assert get_similarity_score(v1, v2) == 1/sqrt(6)
If you feel you need more details, see: https://en.wikipedia.org/wiki/Cosine_similarity.
Exercise 4 - Write some code to create a sparse CSR matrix named 'game_graph' that represents a game graph as described previously.
A few points to note:
1. The input dataset, named 'data', has 4999 games.
2. Take the index of a game in the input dataframe to be the game's index. The index 0 of the input dataframe should also corresponds row 0 and column 0 of the output sparse matrix.
3. You will need to calculate normalized category vector for each of the games.
4. You will then need to find similarity between each pair of the games.
5. The final output **game_graph** should be a 4999x4999 CSR sparse matrix.
A few more points to note:
1. 4999x4999 is a fairly large matrix.
2. 4999 normalized game category vectors, each of size (1x84) also forms a large matrix.
3. Be cautious when using for loops with normal numpy arrays as they will take a considerable amount of time to run.
4. Storing these large matrices into a sparse matrix format would improve the performance significantly.
5. For efficiency's sake, sparse matrix operations like 'vstack()', 'transpose()', and 'dot()', may prove to be convenient.