Reference no: EM133091581
Describing and Visualising Statistical Data
Exercises
Question 1: In a new, empty .pyfIle, write a small program that calculates and prints out the mean, median, and modeof the following set of values:
1978, 1936, 1941, 1999, 2000, 2001, 2020, 2049, 2000, 1801, 1664
When calculating the modeit's easiest to use ‘from collections import Counter' to get access to a Counter object which will do the instance counting for you (see slides 10-12).
Question 2: Add the value 2001 to the above list of values. Now, both 2000 and 2001 occur twice - so the data has multiple modes. Modify your program to cater for this, i.e. if you ask it to calculate the mode of the list of values it will return a list containing both 2000 and 2001. See slide 13 if you need help.
Question 3: Modify your code to print out the highest and lowest values in the list, and from these values calculate and print the range of the values (i.e. the difference between the highest and lowest values).
Question 4: The following functions can be used to calculate the variance of a series of values, and from the variance you can calculate the standard deviation as the square root of the variance (in Python you can do this by raising a value to the power of 0.5, for example:value = 9, sqr_root = value ** 0.5)
defcalculate_mean(numbers):
s = sum(numbers)
N =len(numbers)
mean = s / N
return mean
deffind_differences(numbers):
mean =calculate_mean(numbers)
differences =[]
for num in numbers:
differences.append(num - mean)
return differences
defcalculate_variance(numbers):
differences =find_differences(numbers)
squared_diff=[]
for din differences:
squared_diff.append(d **2)
sum_squared_diff= sum(squared_diff)
variance =sum_squared_diff/len(numbers)
return variance
There is a file on your Moodle shell under this weeks' materials called: pokemon_num_name_height_metres_weight_kgs.csv
This file contains all the numbers, names, heights and weights of over 800 Pokemon (which I found here: https://pokemondb.net/pokedex/stats/height-weight). Take a look at this file in a text editor or excel to see the kind of data we're working with.
To open the file and split up each line into a list of four strings we can use code like this:
with open('pokemon_num_name_height_metres_weight_kg.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
for row in readCSV:
# 0 is number, 1 is name, 2 is height (m), 3 is weight (kg)
print(row[0], row[1], row[2], row[3])
Now, using the above calculate_variancefunction, load the file and calculate the variance and standard deviation of the Pokemon heights and weights - and print them to the screen.
REMEMBER: Each row value (row[0], row[1] etc.) will be a string - so if we want to do any math with any of the numerical fields (which we do) then we'll need to cast them to be a float!
If you've done this right, you should see output like this:
Height variance: 1.2243628057924427
Height standard deviation: 1.1065092886155283
-----
Weight variance: 15580.033463417874
Weight standard deviation: 124.82000425980554
Question 5: As we have the heights and weights for our Pokemon - let's create a quick scatterplot of the data:
from pylab import plot, show, title, xlabel, ylabel
# Have access to thePokemon heights and weights here!
myPlot = plot(weights, heights, 'x')
title('Pokemon Height Vs. Weight')
xlabel('Weight in Kilograms')
ylabel('Height in Metres')
show(myPlot)
If everything's going as planned you should see a plot like this:
Question 6: Our final task for the day will be to determine if there is a statistically significant correlation between the height and the weight of a Pokemon - that is, are bigger Pokemon usually heavier? From looking at the plot, what do you think? Is there any correlation? Or maybe a weak positive, or weak negative correlation?
Here's some code we can use to determine the correlation coefficient of two sets of values:
deffind_correlation(x, y):
# Find the length of the lists
n =len(x)
# Find the sum of the products
products =[]
for xi,yiinzip(x, y):
products.append(xi *yi)
sum_products= sum(products)
# Find the sum of each list
sum_x= sum(x)
sum_y= sum(y)
# Find the squared sum of each list
squared_sum_x=sum_x**2
squared_sum_y=sum_y**2
# Find the sum of the squared lists
x_square=[]
for xi in x:
x_square.append(xi **2)
x_square_sum= sum(x_square)
y_square=[]
foryiin y:
y_square.append(yi**2)
y_square_sum= sum(y_square)
# Use formula to calculate correlation
numerator = n *sum_products-sum_x*sum_y
denominator1 = n *x_square_sum-squared_sum_x
denominator2 = n *y_square_sum-squared_sum_y
denominator =(denominator1 * denominator2)**0.5
correlation = numerator / denominator
return correlation
Use the above function to calculate and print the correlation between the height and weight of Pokemon - if you've done it correctly, you should get output similar to the following:
Correlation between height and weight is: 0.6424145098518806
Looking at the below, this means that there IS a positive correlation - that is, the height of a Pokemon is an indicator that can allow us to estimate its weight... but the correlation is weak, so any estimate that we come up with may have a large margin of error to the actual weight of the Pokemon!
Attachment:- Visualising Statistical Data.rar