Reference no: EM131142323
I need help with my project for big data
It's up to us to define the specific design and limit or expand the scope but should involve substantial design, analysis, programming, and validation. I have listed the topic and technology that should be used for the project. You can make changes and consider other attributes for the system. Let me know the price and process to move further.
Topic: Recommendation or clustering system for IMDB movie dataset.
Technology: Spark will be used for solving the problem.
Data: https://www.imdb.com/interfaces
The movie database will be considering following attributes for the movie recommendation system.
1) Title
2) Genre
3) Actors
4) Actresses
5) Directors
6) Rating
Hypothesis: We wish to build a movie recommendation system, which will suggest movies to user that he/she might be interested in based on the tastes, interests, and people connections. also need a report
Parts for the report-
I. Design:
Design document should contain your proposed design of the solution.
- Summary of problem definition - Focus on explaining what you want to do in the project, any assumptions, and limitations.
- Description of input data - In the design document, you have to provide a summary of your data - for example, data format, attributes, and metadata. Please do not include data inside the report.
- Explanation of your algorithm and pseudocode
- Explanation of your Big Data strategy - Which Big Data strategy did you use and how does it make sense in the overall picture. If you used multiple technologies, list the project phases where you used each.
- Create a data flow diagram for your application. - An example of DFD for MapReduce is here: https://creately.com/diagram/example/h21wfdxq2/MapReduce Similarly, you can create a data flow diagram for your machine learning/data analysis strategy.
- Details of how your application handles bad or missing data and is your strategy robust i.e. can it recover from errors. Similarly, if you use machine learning, how do you handle over fitting?
II. Analysis of Results:
In this section, you will present your final results and analyze them. Following are certain key points:
• Summarize your results well. This could be in the form of tables, graphs, plots, or other visualization tools.
• Validate your results.
For example, you can compute the accuracy of your model on the test dataset, or use cross-validation on the training dataset, or you can show that there is a correlation between positive review and star rating, or that there is a correlation between positive sentiment and stock price. Try to come up with numerical results.
• If you results are below expectation, explain probable reasons.
III. Conclusion
Following are key points:
• Explain how using Big Data helped you with this project? Explain how using Big Data helped you arrive at a better/faster/more efficient solution.
- Describe what you learned in this project.
- Describe how your technique/strategy can be improved
PROJECT SOURCE CODE - Source Code and Sample data files:
Some of the coding requirements are as follows:
- Please include a README file indicating which language and technology you used and how to compile your code.
- You need to use HDFS for at least some part of the project.
- This could mean that you use HDFS for data extraction, pre-processing, or actual classification or clustering. The key is you have to use HDFS somewhere.
- Your code should be well documented.
- Ideally, you should create a UNIX script such that the entire workflow - data extraction, parsing, pre-processing, analysis, MapReduce, machine learning task - can be run using that script. The script can accept parameters from the command line.
Please attach a sample of your data. This should not be the entire dataset, but just the top few lines, so the TA can run your code. About 1000 lines/records should be fine.