# Smart Alex solutions Chapter 7

This document contains abridged sections from Discovering Statistics Using R and RStudio by Andy Field so there are some copyright considerations. You can use this material for teaching and non-profit activities but please do not meddle with it or claim it as your own work. See the full license terms at the bottom of the page.

A student was interested in whether there was a positive relationship between the time spent doing an essay and the mark received. He got 45 of his friends and timed how long they spent writing an essay (hours) and the percentage they got in the essay (essay). He also translated these grades into their degree classifications (grade): in the UK, a student can get a first-class mark (the best), an upper-second-class mark, a lower second, a third, a pass or a fail (the worst). Using the data in the file essay_marks.csv find out what the relationship was between the time spent doing an essay and the eventual mark in terms of percentage and degree class (draw a scatterplot too).

dplyr::mutate(
)

Alternative, load the data directly from the discovr package:

essay_tib <- discovr::essay_marks

### Visualise the data

We’re interested in looking at the relationship between hours spent on an essay and the grade obtained. We could create a scatterplot of hours spent on the essay (x-axis) and essay mark (y-axis). I’ve chosen to highlight the degree classification grades using different colours.

ggplot2::ggplot(essay_tib, aes(hours, essay, colour = grade)) +
geom_point() +
geom_smooth(method = "lm", alpha = 0.1) +
coord_cartesian(xlim = c(0, 15), ylim = c(20, 80)) +
scale_x_continuous(breaks = seq(0, 15, 1)) +
scale_y_continuous(breaks = seq(20, 80, 5)) +
labs(x = "Time spent on essay (hours)", y = "Essay mark (%)", colour = "Grade", fill = "Grade") +
theme_minimal()

### Correlations

essay_tib %>%
dplyr::select(hours, essay) %>%
correlation::correlation() %>%
knitr::kable(digits = 3)
Parameter1Parameter2rCI_lowCI_hightdfpMethodn_Obs
hoursessay0.267-0.0290.521.814430.077Pearson45

The results in the table above indicate that the relationship between time spent writing an essay and grade awarded was not significant, Pearson’s r = 0.27, p = 0.077.

The second part of the question asks us to do the same analysis but when the percentages are recoded into degree classifications. The degree classifications are ordinal data (not interval): they are ordered categories. So we shouldn’t use Pearson’s test statistic, but Spearman’s and Kendall’s ones instead. Grade is a factor:

essay_tib$grade ## [1] Upper second class First class Third class First class ## [5] Lower second class Upper second class Upper second class Upper second class ## [9] Upper second class Third class Upper second class First class ## [13] First class Lower second class Lower second class Upper second class ## [17] Upper second class Upper second class Upper second class Lower second class ## [21] Lower second class First class Upper second class Upper second class ## [25] First class Upper second class Lower second class Lower second class ## [29] Lower second class First class Upper second class Upper second class ## [33] Upper second class First class First class Upper second class ## [37] Upper second class Upper second class Upper second class Upper second class ## [41] Upper second class Upper second class Lower second class Lower second class ## [45] First class ## Levels: First class Upper second class Lower second class Third class We need to convert it into numbers using as.numeric: essay_tib <- essay_tib %>% dplyr::mutate( grade_num = as.numeric(grade) ) Let’s check the numeric grade matches the description (i.e. first = 1, upper second = 2 etc.): idessayhoursgradegrade_num qxbcjg61.6810.63Upper second class2 csykgu69.557.29First class1 npxgpl48.235.05Third class4 gomiwl70.682.89First class1 qdywaa59.909.55Lower second class3 cmaprb61.1611.31Upper second class2 nrvhix67.627.47Upper second class2 ybytlf64.788.47Upper second class2 talwvg63.218.72Upper second class2 ceylea49.696.20Third class4 berjky63.724.77Upper second class2 cjxgpl72.2410.98First class1 gufenh70.595.22First class1 uaxxdm58.646.86Lower second class3 unnhpl58.729.80Lower second class3 Now the correlation: essay_tib %>% dplyr::select(hours, grade_num) %>% correlation::correlation(method = "spearman") %>% knitr::kable(digits = 3) Parameter1Parameter2rhoCI_lowCI_highSpMethodn_Obs hoursgrade_num-0.193-0.4610.10618110.930.204Spearman45 There was no significant relationship between degree grade classification for an essay and the time spent doing it,$ \rho_s = -0.19 $, p = 0.204. Note that the direction of the relationship has reversed. This has happened because the essay marks were recoded as 1 (first), 2 (upper second), 3 (lower second), and 4 (third), so high grades were represented by low numbers. ## Task 7.2 Using the notebook.csv data from Chapter 3, find out if the size of relationship between the participant’s sex and arousal. ### Load the data library(tidyverse) notebook_tib <- readr::read_csv("../data/essay_marks.csv") %>% dplyr::mutate( sex = forcats::as_factor(sex), film = forcats::as_factor(film) ) Alternative, load the data directly from the discovr package: notebook_tib <- discovr::notebook ### Estimate the correlation Sex is a categorical variable with two categories, therefore, we need to quantify this relationship using a point-biserial correlation. notebook_tib %>% dplyr::mutate( sex_bin = ifelse(sex == "Male", 0, 1), ) %>% dplyr::select(arousal, sex_bin) %>% correlation::correlation() %>% knitr::kable(digits = 3) Parameter1Parameter2rCI_lowCI_hightdfpMethodn_Obs arousalsex_bin-0.196-0.4780.123-1.232380.226Pearson40 I used a two-tailed test because one-tailed tests should never really be used. There was no significant relationship between biological sex and arousal because the p-value is larger than 0.05, $$r_\text{pb}$$ = –0.20, p = 0.226. ## Task 7.3 Using the notebook data again, quantify the relationship between the film watched and arousal. notebook_tib %>% dplyr::mutate( film_bin = ifelse(film == "The notebook", 0, 1), ) %>% dplyr::select(arousal, film_bin) %>% correlation::correlation() %>% knitr::kable(digits = 3) Parameter1Parameter2rCI_lowCI_hightdfpMethodn_Obs arousalfilm_bin-0.865-0.927-0.757-10.606380Pearson40 There was a significant relationship between the film watched and arousal, $$r_\text{pb}$$ = –0.86, p < 0.001. Looking at how the groups were coded, you should see that The Notebook had a code of 0, and the documentary about notebooks had a code of 1, therefore the negative coefficient reflects the fact that as film goes up (changes from 0 to 1) arousal goes down. Put another way, as the film changes from The Notebook to a documentary about notebooks, arousal decreases. So The Notebook gave rise to the greater arousal levels. ## Task 7.4 As a statistics lecturer I am interested in the factors that determine whether a student will do well on a statistics course. Imagine I took 25 students and looked at their grades for my statistics course at the end of their first year at university: first, upper second, lower second and third class (see Task 1). I also asked these students what grade they got in their high school maths exams. In the UK GCSEs are school exams taken at age 16 that are graded A, B, C, D, E or F (an A grade is the best). The data for this study are in the file grades.csv. To what degree does GCSE maths grade correlate with first-year statistics grade? ### Load the data The categories in the data file have a meaningful order, so after importing the data we transform stats and gcse to factors but also order the factor levels in the correct order (this is important!). grade_tib <- readr::read_csv("../data/grades.csv") %>% dplyr::mutate( stats = forcats::as_factor(stats) %>% forcats::fct_relevel("First class", "Upper second class", "Lower second class", "Third class", "Pass", "Fail"), gcse = forcats::as_factor(gcse) %>% forcats::fct_relevel("A", "B", "C", "D", "E", "F") ) Alternative, load the data directly from the discovr package: grade_tib <- discovr::grades ### Estimate the correlation Let’s look at these variables. In the UK, GCSEs are school exams taken at age 16 that are graded A, B, C, D, E or F. These grades are categories that have an order of importance (an A grade is better than all of the lower grades). In the UK, a university student can get a first-class mark, an upper second, a lower second, a third, a pass or a fail. These grades are categories, but they have an order to them (an upper second is better than a lower second). When you have categories like these that can be ordered in a meaningful way, the data are said to be ordinal. The data are not interval, because a first-class degree encompasses a 30% range (70–100%), whereas an upper second only covers a 10% range (60–70%). When data have been measured at only the ordinal level they are said to be non-parametric and Pearson’s correlation is not appropriate. Therefore, the Spearman correlation coefficient is used. In the file, the scores are in two columns: one labelled stats and one labelled gcse. Table: Table 1: First 10 rows of grade_tib idstatsgcse wbmewFirst classA ypusyFirst classA sbumvFirst classC wqicpUpper second classC xnobiUpper second classC glxdwUpper second classD eomcnUpper second classE bjhclLower second classA wcypeLower second classB grkvuLower second classC The first thing we need to do is to convert stats and gcse to numbers using as.numeric, which will return the numeric value attached to each factor level: grade_tib <- grade_tib %>% dplyr::mutate( stat_num = as.numeric(stats), gcse_num = as.numeric(gcse) ) Let’s check the numeric grade matches the description (i.e. first = 1, upper second = 2 etc.): Table: Table 2: First 10 rows of grade_tib idstatsgcsestat_numgcse_num wbmewFirst classA11 ypusyFirst classA11 sbumvFirst classC13 wqicpUpper second classC23 xnobiUpper second classC23 glxdwUpper second classD24 eomcnUpper second classE25 bjhclLower second classA31 wcypeLower second classB32 grkvuLower second classC33 Now the correlation: grade_tib %>% dplyr::select(stat_num, gcse_num) %>% correlation::correlation(method = "spearman") %>% knitr::kable(digits = 3) Parameter1Parameter2rhoCI_lowCI_highSpMethodn_Obs stat_numgcse_num0.4550.0720.721418.0350.022Spearman25 In the question I predicted that better grades in GCSE maths would correlate with better degree grades for my statistics course. This hypothesis is directional and so a one-tailed test could be selected; however, in the chapter I advised against one-tailed tests so I have done two-tailed. We could report these results as follows: • There was a significant positive relationship between a person’s statistics grade and their GCSE maths grade,$ r_\text{s} \$ = 0.45, *p* = 0.022.

In Figure 3.4 we saw some data relating to people’s ratings of dishonest acts and the likeableness of the perpetrator (for a full description see Jane Superbrain Box 3.1). Compute the Spearman correlation between ratings of dishonesty and likeableness of the perpetrator. The data are in honesty_lab.csv.

dplyr::mutate(
sex = forcats::as_factor(sex),
film = forcats::as_factor(film)
)

Alternative, load the data directly from the discovr package:

honesty_tib <- discovr::honesty_lab

### Estimate the correlation

honesty_tib %>%
dplyr::select(deed, likeableness) %>%
correlation::correlation(method = "spearman") %>%
knitr::kable(digits = 3)
Parameter1Parameter2rhoCI_lowCI_highSpMethodn_Obs
deedlikeableness0.8440.7760.89226054.240Spearman100

The relationship between ratings of dishonesty and likeableness of the perpetrator was significant because the p-value is less than 0.05 (p = 0.000). The value of Spearman’s correlation coefficient is quite large and positive (0.84), indicating a large positive effect: the more likeable the perpetrator was, the more positively their dishonest acts were viewed.

We could report the results as follows:

• There was a large and significant positive relationship between the likeableness of a perpetrator and how positively their dishonest acts were viewed, $$r_\text{s}$$ = 0.84, p < 0.001.

In Chapter 1 (Task 6) we looked at data from people who had been forced to marry goats and dogs and measured their life satisfaction and, also, how much they like animals (goat_or_dog.csv). Is there a significant correlation between life satisfaction and the type of animal to which a person was married

dplyr::mutate(
wife = forcats::as_factor(wife)
)

Alternative, load the data directly from the discovr package:

goat_tib <- discovr::animal_bride

### Estimate the correlation

Wife is a categorical variable with two categories (goat or dog). Therefore, we need to look at this relationship using a point-biserial correlation.

goat_tib <- goat_tib %>%
dplyr::mutate(
wife_bin = ifelse(wife == "Goat", 0, 1),
)

goat_tib %>%
dplyr::select(life_satisfaction, wife_bin) %>%
correlation::correlation() %>%
knitr::kable(digits = 3)
Parameter1Parameter2rCI_lowCI_hightdfpMethodn_Obs
life_satisfactionwife_bin0.630.2610.8393.446180.003Pearson20

I used a two-tailed test because one-tailed tests should never really be used (see book chapter for more explanation). As you can see there, was a significant relationship between type of animal wife and life satisfaction because our p-value is less than 0.05, $$r_\text{pb}$$ = 0.63, p = 0.003. Looking at how the groups were coded, you should see that goat had a code of 0 and dog had a code of 1, therefore this result reflects the fact that as wife goes up (changes from 0 to 1) life satisfaction goes up. Put another way, as wife changes from goat to dog, life satisfaction increases. So, being married to a dog was associated with greater life satisfaction.

Repeat the analysis above taking account of animal liking when computing the correlation between life satisfaction and the animal to which a person was married.

We can conduct a partial correlation between life satisfaction and the animal to which a person was married while ‘adjusting’ for the effect of liking animals. Remember that we added the variable wife_bin to goat_tib in the previous solution.

goat_tib %>%
dplyr::select(-wife) %>%
correlation::correlation(partial = TRUE) %>%
knitr::kable(digits = 3)
Parameter1Parameter2rCI_lowCI_hightdfpMethodn_Obs
animallife_satisfaction0.6150.2360.8313.306180.008Pearson20
animalwife_bin-0.399-0.7150.053-1.846180.081Pearson20
life_satisfactionwife_bin0.7010.3750.8734.173180.002Pearson20

Notice that the partial correlation between wife and life satisfaction is 0.70, which is greater than the correlation when the effect of animal liking is not adjusted for (r = 0.630 from the previous task). The correlation has become more statistically significant (its p-value has decreased from 0.003 to 0.002..)

Running this analysis has shown us that type of wife alone explains a large portion of the variation in life satisfaction. In other words, the relationship between wife and life satisfaction is not due to animal liking.

In Chapter 1 (Task 7) we looked at data based on findings that the number of cups of tea drunk was related to cognitive functioning (Feng et al., 2010). The data are in the file tea_makes_you_brainy_15.csv. What is the correlation between tea drinking and cognitive functioning? Is there a significant effect?

Alternative, load the data directly from the discovr package:

tea15_tib <- discovr::tea_15

### Estimate the correlation

Because the numbers of cups of tea and cognitive function are both interval variables, we can conduct a Pearson’s correlation coefficient.

tea15_tib %>%
dplyr::select(tea, cog_fun) %>%
correlation::correlation() %>%
knitr::kable(digits = 3)
Parameter1Parameter2rCI_lowCI_hightdfpMethodn_Obs
teacog_fun0.078-0.4530.5670.281130.783Pearson15

I chose a two-tailed test because it is never really appropriate to conduct a one-tailed test (see the book chapter). The results in the table above indicate that the relationship between number of cups of tea drunk per day and cognitive function was not significant. We can tell this because our p-value is greater than 0.05. Pearson’s r = 0.08, p = 0.783.

The research in the previous task was replicated but in a larger sample (N = 716), which is the same as the sample size in Feng et al.’s research (tea_makes_you_brainy_716.csv). Conduct a correlation between tea drinking and cognitive functioning. Compare the correlation coefficient and significance in this large sample, with the previous task. What statistical point do the results illustrate?

Alternative, load the data directly from the discovr package:

tea716_tib <- discovr::tea_716

### Estimate the correlation

tea716_tib %>%
dplyr::select(tea, cog_fun) %>%
correlation::correlation() %>%
knitr::kable(digits = 3)
Parameter1Parameter2rCI_lowCI_hightdfpMethodn_Obs
teacog_fun0.0780.0040.152.0817140.038Pearson716

We can see that although the value of Pearson’s r has not changed, it is still very small (0.08), the relationship between the number of cups of tea drunk per day and cognitive function is now just significant (p = 0.038). This example indicates one of the downfalls of significance testing; you can get significant results when you have large sample sizes even if the effect is very small. Basically, whether you get a significant result or not is entirely subject to the sample size.

In Chapter 6 we looked at hygiene scores over three days of a rock music festival (download_festival.csv). Using Spearman’s correlation, were hygiene scores on day 1 of the festival significantly correlated with those on day 3?

dplyr::mutate(
gender = forcats::as_factor(gender)
)

Alternative, load the data directly from the discovr package:

### Estimate the correlation

First, we need to make the data messy so that scores for different days are in different columns (rather than different rows):

dplyr::select(day_1, day_3) %>%
correlation::correlation(method = "spearman") %>%
knitr::kable(digits = 3)
Parameter1Parameter2rhoCI_lowCI_highSpMethodn_Obs
day_1day_30.3440.1780.491203307.30Spearman123

The hygiene scores on day 1 of the festival correlated significantly with hygiene scores on day 3. The value of Spearman’s correlation coefficient is 0.344, which is a positive value suggesting that the smellier you are on day 1, the smellier you will be on day 3, $$r_\text{s}$$ = 0.34, p < 0.001.

Using the data in shopping_exercise.csv (Chapter 1, Task 5), find out if there is a significant relationship between the time spent shopping and the distance covered.

dplyr::mutate(
sex = forcats::as_factor(sex)
)

Alternative, load the data directly from the discovr package:

shopping_tib <- discovr::shopping

### Estimate the correlation

The variables Time and Distance are both interval. Therefore, we can conduct a Pearson’s correlation. I chose a two-tailed test because it is never really appropriate to conduct a one-tailed test (see the book chapter).

shopping_tib %>%
dplyr::select(distance, time) %>%
correlation::correlation(method = "spearman") %>%
knitr::kable(digits = 3)
Parameter1Parameter2rhoCI_lowCI_highSpMethodn_Obs
distancetime0.830.4210.959280.003Spearman10

The output indicates that there was a significant positive relationship between time spent shopping and distance covered. We can tell that the relationship was significant because the p-value is smaller than 0.05. Also, our value for Pearson’s r is very large (0.83) indicating a large effect. Pearson’s r = 0.83, p = 0.003.

What effect does accounting for the participant’s biological sex have on the relationship between the time spent shopping and the distance covered?

To answer this question, we need to conduct a partial correlation between the time spent shopping (interval variable) and the distance covered (interval variable) while ‘adjusting’ for the effect of sex (dicotomous variable).

First, let’s turn sex into a numeric variable:

shopping_tib <- shopping_tib %>%
dplyr::mutate(
sex_bin = ifelse(sex == "Male", 0, 1),
)

The data look like this:

Table: Table 3: First 10 rows of shopping_tib

sexdistancetimesex_bin
Male0.16150
Male0.40300
Male1.36370
Male1.99650
Male3.611030
Female1.40221
Female1.811401
Female1.961601
Female3.021831
Female4.822451

Now lets run the partial correlation:

shopping_tib %>%
dplyr::select(-sex) %>%
correlation::correlation(partial = TRUE) %>%
knitr::kable(digits = 3)
Parameter1Parameter2rCI_lowCI_hightdfpMethodn_Obs
distancetime0.8200.3950.9564.0680.011Pearson10
distancesex_bin-0.351-0.8030.358-1.0680.320Pearson10
timesex_bin0.6440.0240.9062.3880.089Pearson10

The partial correlation between time and distance is 0.82, which is slightly smaller than the correlation when the effect of sex is not adjusted for (r = 0.830). The correlation has become slightly less statistically significant (its p-value has increased from 0.003 to 0.011). However, not a lot has changed, so even after adjusting for biological sex, the relationship between distance covered and time spent shopping is very strong.