Setting up a pipeline to ingest data from twitter

Assignment Help Other Subject
Reference no: EM133367353

Data Collection and Curation

Introduction

One can argue that the most challenging task in a Big Data setting is getting the data that can then be used for data analysis and predictions. Towards this goal, in this assignment, you will be setting up a pipeline to ingest data from Twitter, clean and process it, and load it into a Hive table for analysis. You will be using Apache Kafka, Apache Flume for data ingestion into HDFS, and Spark SQL for data analysis and Spark ML for prediction.

Instructions

Step 1: Setup Kafka producer to ingest tweets

Setup a Kafka producer in Python that gets data from Twitter for a specific set of keywords related to a topic (the choice of topic and keywords are up to you) and sends it to a topic in a Kafka broker. You will need to sign up for a developer account with Twitter, which is free. The data should be formatted in a way that can be easily ingested by the other components of the pipeline. There is a limit on the number of calls that a producer can make to Twitter at any one time. Check the limitations and adjust your code so that tweets are received continuously without going over the limit. Some sample code is provided for setting up the producer as well online videos.

Step 2: Setup Kafka Consumer

Setup a Kafka consumer that reads from the Kafka topic and saves the data to HDFS. The consumer should be designed to handle large volumes of data and should be fault-tolerant. Some sample Kafka consumers are available as well.

Step 3: Setup Flume Agent

Apache Flume is a streaming tool typically used for text data. Unlike Apache Kafka, it is more lightweight in installation and setup. Review the videos posted on Apache Flume and setup a Flume agent that gets data from Twitter and saves it to HDFS.

Step 4: Clean and Process Data

The data that is saved to HDFS needs to be cleaned and put into multiple columns. It is up to you how you want to clean the data, either in the consumer, producer for Kafka, or in Flume, or at the end of the pipeline. You should ensure that the data is formatted in a way that can be easily loaded into Spark for later processing (see below).

Step 5: Load Data into Spark SQL

Data then must be loaded into a Scala DataFrame for analysis. Use the Scala DataFrame to run some queries on the data that you have read. The queries will depend on the topic that you have chosen and keywords received from Twitter.

Step 6: Train a Spark ML algorithm

Using the data in HDFS, train a machine learning algorithm using Spark ML to predict whether the tweets that you have have ingested have positive sentiment or negative sentiment. You can also choose other predictions depending on the topic.

Reference no: EM133367353

Questions Cloud

Geographical locations around the globe : The inbound side of the supply chain and retailers on the outbound, and their geographical locations around the globe.
What might the deer, the boar, and the fox represent : What might the deer, the boar, and the fox represent? Use the Medieval Bestiary to help you interpret the symbolism of the animals as the medieval audience
Which seems the most manipulative and how do you know : Which of the two summation arguments takes into account most of the factual evidence.? which seems the most manipulative? how do you know?
the American Dream and American society : How would you define "the American Dream"? Do you think it's attainable? What do you think Marx or Engels would say about these ideas?
Setting up a pipeline to ingest data from twitter : BDAT 1008 Data Collection and Curation, Georgian College - train a machine learning algorithm using Spark ML to predict whether the tweets that you have
What is universal grammar : What is Universal Grammar? Consider these two statements: I learned a new word today. I learned a new sentence today. Do you think the two statements are equal
Special challenges of stage of development : Identify risk and protective factors and special challenges of this stage of development.
Provide a minimum of three community resources : Provide a minimum of three community resources, such as local agencies, outreach departments from the department of education, other educational resources
The institution of the family has remained resilient : The institution of the family has remained resilient even though its structures and functions continue to remain in a state of flux.

Reviews

Write a Review

Other Subject Questions & Answers

  Values and ideas of the renaissance

(a) describes the painting in your own words, and (b) demonstrates the values and ideas of the Renaissance that are displayed in this painting. Make sure you connect your essay to the ideas discussed in the textbook. Your answer should be between ..

  Summary of community assessment partnerships

Summary of community assessment partnerships. Description of community and community boundaries: the people and the geographic, geopolitical, financial

  Discuss anything that surprised you or you noticed

Does the audit engagement depicted in the video seem what you have expected? Discuss anything that surprised you or you noticed in particular.

  Identify specific contemporary issue that you are interested

Identify one specific contemporary issue or trend that you are interested in learning more about. Choose from the categories below.

  Describe the role of the school nurse caring for a child

The school nurse has a unique role in the provision of school health services for children with special health needs, including children with chronic illnesses.

  Identify a professional practice use of the theories

Architect Daniel Libeskind is credited with saying "To provide meaningful architecture is not to parody history, but to articulate it." The suggestion is that.

  Q1 a college building used for engineering classes has a

q1. a college building used for engineering classes has a baseline water use of 400000 gallons. what amount of water

  Explain how climate change has created or worsened challenge

Explain how climate change has created or worsened this environmental challenge. Support your explanation with references.

  Sow 4232 social welfare policies and issues question

SOW 4232 - Social Welfare Policies and Issues assignment help and assessment help, University of Central Florida - Provide an analysis of a policy using.

  Patterns in making government work

Patterns in Making Government Work

  Explain how you determined cash flow for separate activities

At this point, you have organized your HR project team and you are familiar with the importance of leading and managing the project and team.

  Prepare analysis and ethics evaluation of the research

Complete an article analysis and ethics evaluation of the research using the "Article Analysis and Evaluation of Research Ethics" template.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd