Reference no: EM132009142
Part A - Introduction To Python For Data Analysis Homework
Instructions - In the cell below please complete the function. The function takes two arguments. The ?rst argument is a the names data frame that we have used in class, the second argument is name. The function should return a new data frame that only contains the rows where the 'Name' column equals the name argument.
You normally would not wrap this code inside a function - it is too simple to put inside a function, but putting the code inside a functions assists the grader.
Part B - Assignment
Overview - In this notebook, we will go over examples of running Spark and do some exercises
1. KDnuggets Tutorial
2. WordCount Exercise
3. K-means Example (Optional)
This notebook was tested on AWS EC2 jupyter interface using UCI BIG DATA AMI.
Question 1 - Sort the tally by year.
Question 2 - Get all professions and counts.
Question 3 - Use Spark, get the 20 most common "lowercased" words (don't count stopwords)
Hint: Suggested Pseudocode, you're welcome to do your own
1. Define a function "findWord" that takes a line as input and return the words & their counts (if stopwords, don't count the word). Use assignment 4 solution the regex code and stopwords.
2. Define a count RDD as: a. flatmap(find,Word) b. aggregate by key, add count c. switch key value pair to value, key pair d. use transformation sortByKey
3. Collect the first 20 elements of the count RDD
4. Collect the last 20 elements of the count RDD
Question 4 - Use Spark, get the 20 least common "lowercased" words (don't count stopwords).
Question 5 - Use Spark, only counts the words that start with a Uppercase letter and print out the top 10 of those words.
Attachment:- Assignment Files.rar