Reference no: EM132734330
Project
Task
You have been provided with a dataset, which you are to use to predict crop yield given climate data, i.e. a regression problem. The task is to perform a "typical" machine learning project:
• Prepare the data
• Implement multiple algorithms, specifically:
? Linear regression (As the baseline; you may reuse your earlier code)
? Regression forest (Use technique in lecture; do not use a histogram)
? Gaussian process
? One of your own choosing (One we taught or something else)
• Evaluate the algorithms on the dataset and tune them to maximise performance
• Critically analyse the results
1. What exploration of the data set was conducted?
2. How was the dataset prepared?
3. How does a regression forest work?
4. How does a Gaussian process work?
5. Which additional algorithm did you choose and why?
6. What are the pros and cons of the algorithms?
7. Describe the toy problem used to validate the algorithms, and explain its design?
8. What evidence of correct, or incorrect, implementation did the toy problem provide?
9. How were the hyperparameters optimised?
10. What results are obtained by the algorithms?
11. How fast do the algorithms run and how fast could they run?
12. Which algorithm would you deploy and why?
13. How could the best algorithm be improved further?
14. If you were to try another algorithm then which one and why?
15. Are the results good enough for real world use?
16. How could this solution fail?
17. What improvements could be made to the data set?
Data set
The task is to predict crop yield (tonnes per hectare) for the major maize crop of farms distributed across the entire planet (the major crop is the main crop of the year; farms may also grow a second crop with a lower yield). For each farm climate data is provided, specifically three features for each month:
• Total rainfall, in mm.
• Mean minimum temperature, in degrees celsius (minimum temperature for each day, averaged over the entire month).
• Mean maximum temperature. Likewise to above.
Year has also been included to account for improvements in farming techniques; the data covers three decades
The data is provided as one csv file (it includes a header) - you are responsible for splitting it for train/test and hyperparameter learning. Each row is an exemplar, and each column a feature.
Your task is to predict the last column (#38) given the first 37 columns. There are 31744 exemplars.
This data set was created by merging and subsampling
• "The global dataset of historical yields for major crops 1981-2016" by Iizumi & Sakai.
• USA NOAA World Weather Records (WWR).
Attachment:- Project.rar