What does it tell us about the accuracy

Reference no: EM132723182

With this, we want "Student" and "Tutor" to become the column names (not the first row values), and we want everything else to be turned into numbers (they look like numbers, but they are not numbers at this point).

We get rid of the first row - like this:
dfTsc<-dfTscH J
now lets manually set the names of the columns
names(dfTsc)[1] <- "Student" names(dfTsc)[2] <- "Tutor"
now lets make all the values in each of those columns as numeric
dfTsc$Student<-as.numeric(dfTsc$Student) dfTsc$Tutor<-as.numeric(dfTsc$Tutor)
Now, lets find the correlation of the two values before we go on:
cor(dfTsc$Student,dfTsc$Tutor,use="pa irwise.complete.obs")
What you should find, is that it is quite high -yes, students do mark similarly to tutors!
Lets plot that:
plot(dfTsc$Student,dfTsc$Tutor,xlab="Student",ylab="Tutor", main="Student vs Teacher marks")
its a little bit disappointing - because there are only two tutors - their marks must be either a whole number or .5 - which makes the data look a bit strange (the strongest visual impact are the straight lines). There is a little trick in R called "jitter" for plots which allow slight varying of the position of a point to make the graphic more readable. Lets vary the tutor point plot(dfTsc$Student,jitter(dfTsc$Tutor,1),xlab="Student",ylab="Tutor", main="Student vs Teacher marks")
here I have varied the position of the tutor dots. This makes the sense of the relationship between tutor and student marks more
meaningful. However, to get the full meaning of the data, lets put a trendline in:
abline(Im(dffsc$Student - dfTsc$Tutor))
When you use that and look at it, what kind of meaning can you take from it?

Try both out. Which is preferable? But ultimately, what does it tell us about the data. What does it tell us about the accuracy?
That's the bar plot - which is good for comparing quantities and means (though percentages are often better displayed with a pie chart).
However, in some cases we want to see the averages + the variance - this is particularly important for things like share prices - you want to know the average, but also, get a sense of how much the share price varies. To do that we use the box plot - which shows the average as well as the maximum and minimum per item shown.
To do that, lets open the rainfall data again. In this case, I am going to use the (tidyverse) library to be able to to manipulate the data in order to be able to plot it as a series of boxplots.
Just as shares have an average but also a variance, rainfall by month does also. There will be an average rainfall, but also, what to expect in a bad month, or in a good month.
Run these two commands
library(tidyverse)
rainfall <- read_csv("rainfall.csv")
At the moment, the month is a factor in the data, appearing in the column of months. However, for us to make boxplots out of them, we will need to radically reshape the data such that the months become columns. Reshaping data is one of the most important skills when preparing data for visualization, so please look up some of the techniques for doing that.

Now have a look at the data - it should be something like:

Year Jan Feb Mar Apr
May Jun Jul Aug
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1910 111. 126. 49.9 95.3
71.8 70.2 97.1 140.
2 1911 59.2 99.7 62.1 69
52.2 77 43.3 69.3
3 1912 112. 79.5 128. 36.1
58.2 124. 92.3 168.
4 1913 123. 57.1 131. 103.
81.5 63.8 33.7 44.5
5 1914 78.8 115. 124. 52.3
59.6 52.5 94.4 80.1
6 1915 119. 141. 55 65.0
53.2 37.7 125. 81.6

And with that we can make our boxplot - though we dont want the first column - so instead of plotting the whole thing, we will plot rfWide2[- c(1)] meaning the table minus the first column
boxplot(rfWide2[-c(1)])
The next plot is the scatterplot where we try to look at the relationship between two different variables (for instance, amount of rainfall vs height of plants) - to see if there is a
relationship between the two.
The data comes from a series of courses I taught nearly 10 years ago, in which I used peer assessment - each group of students had to present their work, and then it was marked by all the students and also 2 tutors.
I am showing you this because it also involves a little data preparation - namely to transpose -

In R there is the notion of "wide" vs "long" data -wide data is data with lots of columns (which is what we want), however, here, we have data with few columns but many rows.
In the tidyverse there is the function spread, which can turn a single column (with different values in it) into a set of columns - so instead of saying
Feb 22
March 24
we would say
Feb March
22 24
To do this in R- do:
rfWide<-spread(rainfall,Month,Rainfallmm)
However, when we do that, the columns are not in month order, but in alphabetical.
To put them in month order we will create a simple vector of what the column names should be
col_order <- c("Year","Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul","Aug", "Sep", "Oct", "Nov", "Dec")
And using that, we can re-order the columns (and just to be safe in case we mess something up we will put it into a new dataframe rfWide2)

Firstly clone the following repository:

(if in R Studio - easiest way is "New Project -> Version Control")
What we will do, is attempt a number of plots in R Base Graphics. But also, I will show you some of the methods necessary to prepare your data in order that it is possible to barplot. I will cover the basics here - but for more sophisticated options - the best place in my experience

Firstly import the file: fruitdata_resolved_clean_jan.csv
This is basically the results of an experiment I did with students 2 years ago. I showed a sequence of vegetables, and then students had to select which variable was shown. However, students were randomly allocated either a text interface with the names of the vegetables, or a visual interface with thumbnails of the variables. The time at which the images were projected was stored, in order to be able to work out how long it took for the student to select the vegetable being projected.

ID pagemode iDdOk seconds
<dbl> <chr>
<1g1> <dbl> ipadd veg <chr> <chr>
1 977 filter.php IP7 cucumber
TRUE 1
2 983 filter.php IP11 cucumber
TRUE 2
3 987 filter.php IP1 cucumber
TRUE 3
4 988 filter.php IP3 cucumber
TRUE 3
5 989 filter.php IP8 cucumber

input (text/image) modality.
Lets go back to the original data (df) and run the command:
table(df$pagemode,df$veg)

We get back

asparagus broccoli cabbage carrot cucumber lettuce onion pea pepper potato spinach
tomato
filter.php 22 32
16 30 29 6
28 27 14 29 23
28
image.php 29 32
30 28 29 26
31 31 28 29 26
30

The great thing about the table command is that its outputs are immediately possible to put into a box plot
So create a new variable from the table command
veggiesByMode <-table(df$pagemode,df$veg)
then try to barplot it
barplot(veggiesByMode)
Because of the nature of the data (two readings per category) what you will actually see at this point is a "stacked" bar chart - where both values are put in the same bar. However, you could show them side by side if you prefer:
barplot(veggiesByMode,beside=TRUE)

Print that to see what it looks like. We can clearly see that the image interface was answered more quickly. But how to plot this?
Essentially, when making bar plots - you need to remember we are plotting numbers against names (e.g a persons height vs a person's name).
dfAvs at this point looks like:

df$pagemode df$seconds
1 filter.php 4.552817

So essentially we need to the first column to be the names, and the second column to be the values.
To do that lets rename the columns to something simpler:
colnames(dfAvs)<-c("Mode","Seconds")
Then we can make a barplot command using those column names
barplot(dfAvs$Seconds,names=dfAvs$Mode)
So in order to get a reasonable bar plot we had to mess with the data a bit.
Ok - so that is the users - what about the vegetables - were they all equally recognized or were some easier than others?
In this case, rather than averages, we want to
do counts. And there is a very handy function in R which can help, called "table()"
In our case, we want to know how many correct identifications were made per vegetable, by input (text/image) modality.

The important variables are pagemode (two possibilities filter.php (text) or image.php (thumbnails)). ipadd (anonymized ip address) veg (the name of the vegetable shown and correctly identified) and seconds (the time it took to identify). There is a weakness in this dataset in that it is cleaned to only have correctly identified vegetables (not mistakes) -the full (but messy) data is in the excel file in the repo.
The first thing we will do will be to change the name of the imported dataframe - because its tedious to continually type fruitdata_resolved_clean so instead just do this command:
df <- read_csv("fruitdata_resolved_clean.csv")
Essentially df now means the same as fruitdata_resolved_clean
The first thing we are clearly interested in is, on average, how long on average it took for students to identify the vegetables shown.
The easiest way to group data in order to get the means (though we could also use it to get other things like the standard deviation or the total) is to use the aggregate function
dfAvs<-aggregate(df$seconds - df$pagemode, FUN=mean)

Reference no: EM132723182

Questions Cloud

When an auditor should test account balances : Identify at least two situations when an auditor should test account balances. Support your rationale with related examples of such circumstances.

Prepare a statement of partnership liquidation : Prepare a statement of partnership liquidation for Gulf Horizons Law for the period March 5 through March 31, 20-2

Research the four categories of reciprocating engines : For this activity you will research the four categories of reciprocating engines.

Create a report that can be included within a business plan : You will complete a market analysis for your proposed organization and create a report that can be included within a business plan.

What does it tell us about the accuracy : What does it tell us about the data. What does it tell us about the accuracy - appearing in the column of months. However, for us to make boxplots out of them

Prepare the general journal entry to close income summary : Prepare the general journal entry to close Income Summary to the partners' capital accounts as of December 31, 20-1

Find and discuss the components of the fraud triangle : Discuss the components of the fraud triangle that affected Ms. Alverez's behavior. Comment on Mr. Sawyer's motives for establishing the percentage

Research paper on reciprocating engine : This activity will involve researching the operation or a major component of a reciprocating engine.

Prepare the lower portion of the income statement : Prepare the lower portion of the income statement of Gulf Horizons Law for the year ended December 31, 20-1, showing the division of the partnership net income

User Account

All Pages