Smart Alex

NoteWho is Smart Alex?

Alex was aptly named because she’s, like, super smart. She likes teaching people, and her hobby is posing people questions so that she can explain the answers to them. Alex appears at the end of each chapter of Discovering Statistics Using R and RStudio (2nd edition) to pose you some questions and give you tasks to help you to practice your data analysis skills. This page contains her answers to those questions.

WarningUsage

This document contains abridged sections from Discovering Statistics Using R and RStudio by Andy Field so there are some copyright considerations. You can use this material for teaching and non-profit activities but please do not meddle with it or claim it as your own work. See the full license terms at the bottom of the page.

General tips

Tip 1: Load packages

When attempting these tasks, remember to use a setup code chunk at the start of your Quarto document (below the yaml) and load any packages you need using library(). In general, most tasks are likely to use the tidyverse and easystats packages so as a bare minimum your setup code chunk should contain

Tip 2: Load data

Many tasks will require you to load data. You can do this in one of two ways. The easiest is to grab the data directly from the discovr package using the general code:

my_tib <- discovr::name_of_data

in which you replace my_tib with the name you wish to give the data once loaded and name_of_data is the data you wish to load. In RStudio if you type discovr:: it will list the datasets available. For example, to load the data called students into a tibble called students_tib we’d execute:

students_tib <- discovr::students

Alternatively, download the CSV file from https://www.discovr.rocks/csv/name_of_file replacing name_of_file with the name of the csv file. For example, to download students.csv use this url: https://www.discovr.rocks/csv/students.csv

Set up an RStudio project (exactly as described in Chapter 4), place the csv file in your data folder and use this general code:

my_tib <- here::here("data/name_of_file.csv") |> 
  read_csv()

in which you replace my_tib with the name you wish to give the data once loaded and name_of_file.csv is the name of the file you wish to read into . We use here to locate the file within the project and read_csv() to read it into . Using this method you will need to explicitly set any factor variables to factors. For example, for students.csv you would execute

student_tib <- here::here("data/students.csv") |> 
  read_csv() |>
  mutate(
    group = as_factor(group)
  )

Chapter 1

Task 1.1

What are (broadly speaking) the five stages of the research process?

  1. Generating a research question: through an initial observation (hopefully backed up by some data).
  2. Generate a theory to explain your initial observation.
  3. Generate hypotheses: break your theory down into a set of testable predictions.
  4. Collect data to test the theory: decide on what variables you need to measure to test your predictions and how best to measure or manipulate those variables.
  5. Analyse the data: look at the data visually and by fitting a statistical model to see if it supports your predictions (and therefore your theory). At this point you should return to your theory and revise it if necessary.

Task 1.2

What is the fundamental difference between experimental and correlational research?

In a word, causality. In experimental research we manipulate a variable (predictor, independent variable) to see what effect it has on another variable (outcome, dependent variable). This manipulation, if done properly, allows us to compare situations where the causal factor is present to situations where it is absent. Therefore, if there are differences between these situations, we can attribute cause to the variable that we manipulated. In correlational research, we measure things that naturally occur and so we cannot attribute cause but instead look at natural covariation between variables.

Task 1.3

What is the level of measurement of the following variables?

  • The number of downloads of different bands’ songs on iTunes:
    • This is a discrete ratio measure. It is discrete because you can download only whole songs, and it is ratio because it has a true and meaningful zero (no downloads at all).
  • The names of the bands downloaded.
    • This is a nominal variable. Bands can be identified by their name, but the names have no meaningful order. The fact that Norwegian black metal band 1349 called themselves 1349 does not make them better than British boy-band has-beens 911; the fact that 911 were a bunch of talentless idiots does, though.
  • Their positions in the download chart.
    • This is an ordinal variable. We know that the band at number 1 sold more than the band at number 2 or 3 (and so on) but we don’t know how many more downloads they had. So, this variable tells us the order of magnitude of downloads, but doesn’t tell us how many downloads there actually were.
  • The money earned by the bands from the downloads.
    • This variable is continuous and ratio. It is continuous because money (pounds, dollars, euros or whatever) can be broken down into very small amounts (you can earn fractions of euros even though there may not be an actual coin to represent these fractions).
  • The weight of drugs bought by the band with their royalties.
    • This variable is continuous and ratio. If the drummer buys 100 g of cocaine and the singer buys 1 kg, then the singer has 10 times as much.
  • The type of drugs bought by the band with their royalties.
    • This variable is categorical and nominal: the name of the drug tells us something meaningful (crack, cannabis, amphetamine, etc.) but has no meaningful order.
  • The phone numbers that the bands obtained because of their fame.
    • This variable is categorical and nominal too: the phone numbers have no meaningful order; they might as well be letters. A bigger phone number did not mean that it was given by a better person.
  • The gender of the people giving the bands their phone numbers.
    • This variable is categorical: the people dishing out their phone numbers could fall into one of several categories based on how they self-identify when asked about their gender (their gender identity could be fluid). Taking a very simplistic view of gender, the variable might contain categories of male, female, and non-binary.
  • The instruments played by the band members.
    • This variable is categorical and nominal too: the instruments have no meaningful order but their names tell us something useful (guitar, bass, drums, etc.).
  • The time they had spent learning to play their instruments.
    • This is a continuous and ratio variable. The amount of time could be split into infinitely small divisions (nanoseconds even) and there is a meaningful true zero (no time spent learning your instrument means that, like 911, you can’t play at all).

Task 1.4

Say I own 857 CDs. My friend has written a computer program that uses a webcam to scan my shelves in my house where I keep my CDs and measure how many I have. His program says that I have 863 CDs. Define measurement error. What is the measurement error in my friend’s CD counting device?

Measurement error is the difference between the true value of something and the numbers used to represent that value. In this trivial example, the measurement error is 6 CDs. In this example we know the true value of what we’re measuring; usually we don’t have this information, so we have to estimate this error rather than knowing its actual value.

Task 1.5

Sketch the shape of a normal distribution, a positively skewed distribution and a negatively skewed distribution.

Normal

Positive skew

Negative skew

Task 1.6

In 2011 I got married and we went to Disney Florida for our honeymoon. We bought some bride and groom Mickey Mouse hats and wore them around the parks. The staff at Disney are really nice and upon seeing our hats would say ‘congratulations’ to us. We counted how many times people said congratulations over 7 days of the honeymoon: 5, 13, 7, 14, 11, 9, 17. Calculate the mean, median, sum of squares, variance and standard deviation of these data.

First compute the mean:

\[ \begin{aligned} \overline{X} &= \frac{\sum_{i=1}^{n} x_i}{n} \\ &= \frac{5+13+7+14+11+9+17}{7} \\ &= \frac{76}{7} \\ &= 10.86 \end{aligned} \]

To calculate the median, first let’s arrange the scores in ascending order: 5, 7, 9, 11, 13, 14, 17. The median will be the (n + 1)/2th score. There are 7 scores, so this will be the 8/2 = 4th. The 4th score in our ordered list is 11.

To calculate the sum of squares, first take the mean from each score, then square this difference, finally, add up these squared values:

Table 1: Calculating sums of squares
Score Error (score - mean) Error squared
5 -5.86 34.34
13 2.14 4.58
7 -3.86 14.90
14 3.14 9.86
11 0.14 0.02
9 -1.86 3.46
17 6.14 37.70
Total 104.86

So, the sum of squared errors is (Table 1):

\[ \begin{aligned} \text{SS} &= 34.34 + 4.58 + 14.90 + 9.86 + 0.02 + 3.46 + 37.70 \\ &= 104.86 \\ \end{aligned} \]

The variance is the sum of squared errors divided by the degrees of freedom:

\[ \begin{aligned} s^2 &= \frac{SS}{N - 1} \\ &= \frac{104.86}{6} \\ &= 17.48 \end{aligned} \]

The standard deviation is the square root of the variance:

\[ \begin{aligned} s &= \sqrt{s^2} \\ &= \sqrt{17.48} \\ &= 4.18 \end{aligned} \]

Task 1.7

In this chapter we used an example of the time taken for 21 heavy smokers to fall off a treadmill at the fastest setting (18, 16, 18, 24, 23, 22, 22, 23, 26, 29, 32, 34, 34, 36, 36, 43, 42, 49, 46, 46, 57). Calculate the sums of squares, variance and standard deviation of these data.

To calculate the sum of squares, take the mean from each value, then square this difference. Finally, add up these squared values (the values in the final column). The sum of squared errors is a massive 2685.24 (Table 2).

Table 2: Calculating sums of squares
Score Mean Difference Difference squared
18 32.19 -14.19 201.356
16 32.19 -16.19 262.116
18 32.19 -14.19 201.356
24 32.19 -8.19 67.076
23 32.19 -9.19 84.456
22 32.19 -10.19 103.836
22 32.19 -10.19 103.836
23 32.19 -9.19 84.456
26 32.19 -6.19 38.316
29 32.19 -3.19 10.176
32 32.19 -0.19 0.036
34 32.19 1.81 3.276
34 32.19 1.81 3.276
36 32.19 3.81 14.516
36 32.19 3.81 14.516
43 32.19 10.81 116.856
42 32.19 9.81 96.236
49 32.19 16.81 282.576
46 32.19 13.81 190.716
46 32.19 13.81 190.716
57 32.19 24.81 615.536
Total 2685.236

The variance is the sum of squared errors divided by the degrees of freedom (\(N-1\)). There were 21 scores and so the degrees of freedom were 20. The variance is, therefore:

\[ \begin{aligned} s^2 &= \frac{SS}{N - 1} \\ &= \frac{2685.24}{20} \\ &= 134.26 \end{aligned} \]

The standard deviation is the square root of the variance:

\[ \begin{aligned} s &= \sqrt{s^2} \\ &= \sqrt{134.26} \\ &= 11.59 \end{aligned} \]

Task 1.8

Sports scientists sometimes talk of a ‘red zone’, which is a period during which players in a team are more likely to pick up injuries because they are fatigued. When a player hits the red zone it is a good idea to rest them for a game or two. At a prominent London football club that I support, they measured how many consecutive games the 11 first team players could manage before hitting the red zone: 10, 16, 8, 9, 6, 8, 9, 11, 12, 19, 5. Calculate the mean, standard deviation, median, range and interquartile range.

First we need to compute the mean:

\[ \begin{aligned} \overline{X} &= \frac{\sum_{i=1}^{n} x_i}{n} \\ &= \frac{10+16+8+9+6+8+9+11+12+19+5}{11} \\ &= \frac{113}{11} \\ &= 10.27 \end{aligned} \]

Then the standard deviation, which we do as follows:

Table 3: Calculating sums of squares
Score Error (score - mean) Error squared
10 -0.27 0.07
16 5.73 32.83
8 -2.27 5.15
9 -1.27 1.61
6 -4.27 18.23
8 -2.27 5.15
9 -1.27 1.61
11 0.73 0.53
12 1.73 2.99
19 8.73 76.21
5 -5.27 27.77
Total 172.15

So, the sum of squared errors is (Table 3):

\[ \begin{aligned} \text{SS} &= 0.07 + 32.83 + 5.15 + 1.61 + 18.23 + 5.15 + 1.61 + 0.53 + 2.99 + 76.21 + 27.77 \\ &= 172.15 \\ \end{aligned} \]

The variance is the sum of squared errors divided by the degrees of freedom:

\[ \begin{aligned} s^2 &= \frac{SS}{N - 1} \\ &= \frac{172.15}{10} \\ &= 17.22 \end{aligned} \]

The standard deviation is the square root of the variance:

\[ \begin{aligned} s &= \sqrt{s^2} \\ &= \sqrt{17.22} \\ &= 4.15 \end{aligned} \]

  • To calculate the median, range and interquartile range, first let’s arrange the scores in ascending order: 5, 6, 8, 8, 9, 9, 10, 11, 12, 16, 19. The median: The median will be the (\(n + 1\))/2th score. There are 11 scores, so this will be the 12/2 = 6th. The 6th score in our ordered list is 9 games. Therefore, the median number of games is 9.
  • The lower quartile: This is the median of the lower half of scores. If we split the data at 9 (the 6th score), there are 5 scores below this value. The median of 5 = 6/2 = 3rd score. The 3rd score is 8, the lower quartile is therefore 8 games.
  • The upper quartile: This is the median of the upper half of scores. If we split the data at 9 again (not including this score), there are 5 scores above this value. The median of 5 = 6/2 = 3rd score above the median. The 3rd score above the median is 12; the upper quartile is therefore 12 games.
  • The range: This is the highest score (19) minus the lowest (5), i.e. 14 games.
  • The interquartile range: This is the difference between the upper and lower quartile: 12−8 = 4 games.

Task 1.9

Celebrities always seem to be getting divorced. The (approximate) length of some celebrity marriages in days are: 240 (J-Lo and Cris Judd), 144 (Charlie Sheen and Donna Peele), 143 (Pamela Anderson and Kid Rock), 72 (Kim Kardashian, if you can call her a celebrity), 30 (Drew Barrymore and Jeremy Thomas), 26 (Axl Rose and Erin Everly), 2 (Britney Spears and Jason Alexander), 150 (Drew Barrymore again, but this time with Tom Green), 14 (Eddie Murphy and Tracy Edmonds), 150 (Renee Zellweger and Kenny Chesney), 1657 (Jennifer Aniston and Brad Pitt). Compute the mean, median, standard deviation, range and interquartile range for these lengths of celebrity marriages.

First we need to compute the mean:

\[ \begin{aligned} \overline{X} &= \frac{\sum_{i=1}^{n} x_i}{n} \\ &= \frac{240+144+143+72+30+26+2+150+14+150+1657}{11} \\ &= \frac{2628}{11} \\ &= 238.91 \end{aligned} \]

Then the standard deviation, which we do as follows:

Table 4: Calculating sums of squares
Score Error (score - mean) Error squared
240 1.09 1.19
144 -94.91 9007.91
143 -95.91 9198.73
72 -166.91 27858.95
30 -208.91 43643.39
26 -212.91 45330.67
2 -236.91 56126.35
150 -88.91 7904.99
14 -224.91 50584.51
150 -88.91 7904.99
1657 1418.09 2010979.25
Total 2268541

So, the sum of squared errors is the sum of the final column (Table 4). The variance is the sum of squared errors divided by the degrees of freedom:

\[ \begin{aligned} s^2 &= \frac{SS}{N - 1} \\ &= \frac{2268541}{10} \\ &= 226854.1 \end{aligned} \]

The standard deviation is the square root of the variance:

\[ \begin{aligned} s &= \sqrt{s^2} \\ &= \sqrt{226854.1} \\ &= 476.29 \end{aligned} \]

  • To calculate the median, range and interquartile range, first let’s arrange the scores in ascending order: 2, 14, 26, 30, 72, 143, 144, 150, 150, 240, 1657. The median: The median will be the (n + 1)/2th score. There are 11 scores, so this will be the 12/2 = 6th. The 6th score in our ordered list is 143. The median length of these celebrity marriages is therefore 143 days.
  • The lower quartile: This is the median of the lower half of scores. If we split the data at 143 (the 6th score), there are 5 scores below this value. The median of 5 = 6/2 = 3rd score. The 3rd score is 26, the lower quartile is therefore 26 days.
  • The upper quartile: This is the median of the upper half of scores. If we split the data at 143 again (not including this score), there are 5 scores above this value. The median of 5 = 6/2 = 3rd score above the median. The 3rd score above the median is 150; the upper quartile is therefore 150 days.
  • The range: This is the highest score (1657) minus the lowest (2), i.e. 1655 days.
  • The interquartile range: This is the difference between the upper and lower quartile: 150−26 = 124 days.

Task 1.10

Repeat Task 9 but excluding Jennifer Anniston and Brad Pitt’s marriage. How does this affect the mean, median, range, interquartile range, and standard deviation? What do the differences in values between Tasks 9 and 10 tell us about the influence of unusual scores on these measures?

First let’s compute the new mean:

\[ \begin{aligned} \overline{X} &= \frac{\sum_{i=1}^{n} x_i}{n} \\ &= \frac{240+144+143+72+30+26+2+150+14+150}{11} \\ &= \frac{971}{11} \\ &= 97.1 \end{aligned} \]

The mean length of celebrity marriages is now 97.1 days compared to 238.91 days when Jennifer Aniston and Brad Pitt’s marriage was included. This demonstrates that the mean is greatly influenced by extreme scores.

Let’s now calculate the standard deviation excluding Jennifer Aniston and Brad Pitt’s marriage (Table 5):

Table 5: Calculating sums of squares
Score Error (score - mean) Error squared
240 142.9 20420.41
144 46.9 2199.61
143 45.9 2106.81
72 -25.1 630.01
30 -67.1 4502.41
26 -71.1 5055.21
2 -95.1 9044.01
150 52.9 2798.41
14 -83.1 6905.61
150 52.9 2798.41
Total 56460.9

So, the sum of squared errors is:

\[ \begin{aligned} \text{SS} &= 20420.41 + 2199.61 + 2106.81 + 630.01 + 4502.41 + 5055.21 + 9044.01 + 2798.41 + 6905.61 + 2798.41 \\ &= 56460.90 \\ \end{aligned} \]

The variance is the sum of squared errors divided by the degrees of freedom:

\[ \begin{aligned} s^2 &= \frac{SS}{N - 1} \\ &= \frac{56460.90}{9} \\ &= 6273.43 \end{aligned} \]

The standard deviation is the square root of the variance:

\[ \begin{aligned} s &= \sqrt{s^2} \\ &= \sqrt{6273.43} \\ &= 79.21 \end{aligned} \]

From these calculations we can see that the variance and standard deviation, like the mean, are both greatly influenced by extreme scores. When Jennifer Aniston and Brad Pitt’s marriage was included in the calculations (see Smart Alex Task 9), the variance and standard deviation were much larger, i.e. 226854.09 and 476.29 respectively.

  • To calculate the median, range and interquartile range, first, let’s again arrange the scores in ascending order but this time excluding Jennifer Aniston and Brad Pitt’s marriage: 2, 14, 26, 30, 72, 143, 144, 150, 150, 240.
  • The median: The median will be the (n + 1)/2 score. There are now 10 scores, so this will be the 11/2 = 5.5th. Therefore, we take the average of the 5th score and the 6th score. The 5th score is 72, and the 6th is 143; the median is therefore 107.5 days.
  • The lower quartile: This is the median of the lower half of scores. If we split the data at 107.5 (this score is not in the data set), there are 5 scores below this value. The median of 5 = 6/2 = 3rd score. The 3rd score is 26; the lower quartile is therefore 26 days.
  • The upper quartile: This is the median of the upper half of scores. If we split the data at 107.5 (this score is not actually present in the data set), there are 5 scores above this value. The median of 5 = 6/2 = 3rd score above the median. The 3rd score above the median is 150; the upper quartile is therefore 150 days.
  • The range: This is the highest score (240) minus the lowest (2), i.e. 238 days. You’ll notice that without the extreme score the range drops dramatically from 1655 to 238 – less than half the size.
  • The interquartile range: This is the difference between the upper and lower quartile: 150 − 26 = 124 days of marriage. This is the same as the value we got when Jennifer Aniston and Brad Pitt’s marriage was included. This demonstrates the advantage of the interquartile range over the range, i.e. it isn’t affected by extreme scores at either end of the distribution

Chapter 2

Task 2.1

Why do we use samples?

We are usually interested in populations, but because we cannot collect data from every human being (or whatever) in the population, we collect data from a small subset of the population (known as a sample) and use these data to infer things about the population as a whole.

Task 2.2

What is the mean and how do we tell if it’s representative of our data?

The mean is a simple statistical model of the centre of a distribution of scores. A hypothetical estimate of the ‘typical’ score. We use the variance, or standard deviation, to tell us whether it is representative of our data. The standard deviation is a measure of how much error there is associated with the mean: a small standard deviation indicates that the mean is a good representation of our data.

Task 2.3

What’s the difference between the standard deviation and the standard error?

The standard deviation tells us how much observations in our sample differ from the mean value within our sample. The standard error tells us not about how the sample mean represents the sample itself, but how well the sample mean represents the population mean. The standard error is the standard deviation of the sampling distribution of a statistic. For a given statistic (e.g. the mean) it tells us how much variability there is in this statistic across samples from the same population. Large values, therefore, indicate that a statistic from a given sample may not be an accurate reflection of the population from which the sample came.

Task 2.4

In Chapter 1 we used an example of the time in seconds taken for 21 heavy smokers to fall off a treadmill at the fastest setting (18, 16, 18, 24, 23, 22, 22, 23, 26, 29, 32, 34, 34, 36, 36, 43, 42, 49, 46, 46, 57). Calculate standard error and 95% confidence interval for these data.

If you did the tasks in Chapter 1, you’ll know that the mean is 32.19 seconds:

\[ \begin{aligned} \overline{X} &= \frac{\sum_{i=1}^{n} x_i}{n} \\ &= \frac{16+(2\times18)+(2\times22)+(2\times23)+24+26+29+32+(2\times34)+(2\times36)+42+43+(2\times46)+49+57}{21} \\ &= \frac{676}{21} \\ &= 32.19 \end{aligned} \]

We also worked out that the sum of squared errors was 2685.24; the variance was 2685.24/20 = 134.26; the standard deviation is the square root of the variance, so was \(\sqrt(134.26)\) = 11.59. The standard error will be:

\[ SE = \frac{s}{\sqrt{N}} = \frac{11.59}{\sqrt{21}} = 2.53 \]

The sample is small, so to calculate the confidence interval we need to find the appropriate value of t. First we need to calculate the degrees of freedom, \(N − 1\). With 21 data points, the degrees of freedom are 20. For a 95% confidence interval we can look up the value in the column labelled ‘Two-Tailed Test’, ‘0.05’ in the table of critical values of the t-distribution (Appendix). The corresponding value is 2.09. The confidence intervals is, therefore, given by:

\[ \begin{aligned} \text{95% CI}_\text{lower boundary} &= \overline{X}-(2.09 \times SE)) \\ &= 32.19 – (2.09 × 2.53) \\ & = 26.90 \\ \text{95% CI}_\text{upper boundary} &= \overline{X}+(2.09 \times SE) \\ &= 32.19 + (2.09 × 2.53) \\ &= 37.48 \end{aligned} \]

Task 2.5

What do the sum of squares, variance and standard deviation represent? How do they differ?

All of these measures tell us something about how well the mean fits the observed sample data. Large values (relative to the scale of measurement) suggest the mean is a poor fit of the observed scores, and small values suggest a good fit. They are also, therefore, measures of dispersion, with large values indicating a spread-out distribution of scores and small values showing a more tightly packed distribution. These measures all represent the same thing, but differ in how they express it. The sum of squared errors is a ‘total’ and is, therefore, affected by the number of data points. The variance is the ‘average’ variability but in units squared. The standard deviation is the average variation but converted back to the original units of measurement. As such, the size of the standard deviation can be compared to the mean (because they are in the same units of measurement).

Task 2.6

What is a test statistic and what does it tell us?

A test statistic is a statistic for which we know how frequently different values occur. The observed value of such a statistic is typically used to test hypotheses, or to establish whether a model is a reasonable representation of what’s happening in the population.

Task 2.7

What are Type I and Type II errors?

A Type I error occurs when we believe that there is a genuine effect in our population, when in fact there isn’t. A Type II error occurs when we believe that there is no effect in the population when, in reality, there is.

Task 2.8

What is statistical power?

Power is the ability of a test to detect an effect of a particular size (a value of 0.8 is a good level to aim for).

Task 2.9

Figure 2.21 (in the book) shows two experiments that looked at the effect of singing versus conversation on how much time a woman would spend with a man. In both experiments the means were 10 (singing) and 12 (conversation), the standard deviations in all groups were 3, but the group sizes were 10 per group in the first experiment and 100 per group in the second. Compute the values of the confidence intervals displayed in the Figure.

Experiment 1:

In both groups, because they have a standard deviation of 3 and a sample size of 10, the standard error will be:

\[ SE = \frac{s}{\sqrt{N}} = \frac{3}{\sqrt{10}} = 0.95 \]

The sample is small, so to calculate the confidence interval we need to find the appropriate value of t. First we need to calculate the degrees of freedom, \(N − 1\). With 10 data points, the degrees of freedom are 9. For a 95% confidence interval we can look up the value in the column labelled ‘Two-Tailed Test’, ‘0.05’ in the table of critical values of the t-distribution (Appendix). The corresponding value is 2.26. The confidence interval for the singing group is, therefore, given by:

\[ \begin{aligned} \text{95% CI}_\text{lower boundary} &= \overline{X}-(2.26 \times SE) \\ &= 10 – (2.26 × 0.95) \\ & = 7.85 \\ \text{95% CI}_\text{upper boundary} &= \overline{X}+(2.26 \times SE) \\ &= 10 + (2.26 × 0.95) \\ &= 12.15 \end{aligned} \]

For the conversation group:

\[ \begin{aligned} \text{95% CI}_\text{lower boundary} &= \overline{X}-(2.26 \times SE) \\ &= 12 – (2.26 × 0.95) \\ & = 9.85 \\ \text{95% CI}_\text{upper boundary} &= \overline{X}+(2.26 \times SE) \\ &= 12 + (2.26 × 0.95) \\ &= 14.15 \end{aligned} \]

Experiment 2

In both groups, because they have a standard deviation of 3 and a sample size of 100, the standard error will be:

\[ SE = \frac{s}{\sqrt{N}} = \frac{3}{\sqrt{100}} = 0.3 \]

The sample is large, so to calculate the confidence interval we need to find the appropriate value of z. For a 95% confidence interval we should look up the value of 0.025 in the column labelled Smaller Portion in the table of the standard normal distribution (Appendix). The corresponding value is 1.96. The confidence interval for the singing group is, therefore, given by:

\[ \begin{aligned} \text{95% CI}_\text{lower boundary} &= \overline{X}-(1.96 \times SE) \\ &= 10 – (1.96 × 0.3) \\ & = 9.41 \\ \text{95% CI}_\text{upper boundary} &= \overline{X}+(1.96 \times SE) \\ &= 10 + (1.96 × 0.3) \\ &= 10.59 \end{aligned} \]

For the conversation group:

\[ \begin{aligned} \text{95% CI}_\text{lower boundary} &= \overline{X}-(1.96 \times SE) \\ &= 12 – (1.96 × 0.3) \\ & = 11.41 \\ \text{95% CI}_\text{upper boundary} &= \overline{X}+(1.96 \times SE) \\ &= 12 + (1.96 × 0.3) \\ &= 12.59 \end{aligned} \]

Task 2.10

Figure 2.22 shows a similar study to above, but the means were 10 (singing) and 10.01 (conversation), the standard deviations in both groups were 3, and each group contained 1 million people. Compute the values of the confidence intervals displayed in the figure.

In both groups, because they have a standard deviation of 3 and a sample size of 1,000,000, the standard error will be:

\[ SE = \frac{s}{\sqrt{N}} = \frac{3}{\sqrt{1000000}} = 0.003 \]

The sample is large, so to calculate the confidence interval we need to find the appropriate value of z. For a 95% confidence interval we should look up the value of 0.025 in the column labelled Smaller Portion in the table of the standard normal distribution (Appendix). The corresponding value is 1.96. The confidence interval for the singing group is, therefore, given by:

\[ \begin{aligned} \text{95% CI}_\text{lower boundary} &= \overline{X}-(1.96 \times SE) \\ &= 10 – (1.96 × 0.003) \\ & = 9.99412 \\ \text{95% CI}_\text{upper boundary} &= \overline{X}+(1.96 \times SE) \\ &= 10 + (1.96 × 0.003) \\ &= 10.00588 \end{aligned} \]

For the conversation group:

\[ \begin{aligned} \text{95% CI}_\text{lower boundary} &= \overline{X}-(1.96 \times SE) \\ &= 10.01 – (1.96 × 0.003) \\ & = 10.00412 \\ \text{95% CI}_\text{upper boundary} &= \overline{X}+(1.96 \times SE) \\ &= 10.01 + (1.96 × 0.003) \\ &= 10.01588 \end{aligned} \]

Note: these values will look slightly different than the plot because the exact means were 10.00147 and 10.01006, but we rounded off to 10 and 10.01 to make life a bit easier. If you use these exact values you’d get, for the singing group:

\[ \begin{aligned} \text{95% CI}_\text{lower boundary} &= \overline{X}-(1.96 \times SE) \\ &= 10.01006 – (1.96 × 0.003) \\ & = 9.99559 \\ \text{95% CI}_\text{upper boundary} &= \overline{X}+(1.96 \times SE) \\ &= 10.01006 + (1.96 × 0.003) \\ &= 10.00735 \end{aligned} \]

For the conversation group:

\[ \begin{aligned} \text{95% CI}_\text{lower boundary} &= \overline{X}-(1.96 \times SE) \\ &= 10.01006 – (1.96 × 0.003) \\ & = 10.00418 \\ \text{95% CI}_\text{upper boundary} &= \overline{X}+(1.96 \times SE) \\ &= 10.01006 + (1.96 × 0.003) \\ &= 10.01594 \end{aligned} \]

Task 2.11

In Chapter 1 (Task 8) we looked at an example of how many games it took a sportsperson before they hit the ‘red zone’ Calculate the standard error and confidence interval for those data.

We worked out in Chapter 1 that the mean was 10.27, the standard deviation 4.15, and there were 11 sportspeople in the sample. The standard error will be:

\[ SE = \frac{s}{\sqrt{N}} = \frac{4.15}{\sqrt{11}} = 1.25 \]

The sample is small, so to calculate the confidence interval we need to find the appropriate value of t. First we need to calculate the degrees of freedom, \(N − 1\). With 11 data points, the degrees of freedom are 10. For a 95% confidence interval we can look up the value in the column labelled ‘Two-Tailed Test’, ‘.05’ in the table of critical values of the t-distribution (Appendix). The corresponding value is 2.23. The confidence interval is, therefore, given by:

\[ \begin{aligned} \text{95% CI}_\text{lower boundary} &= \overline{X}-(2.23 \times SE) \\ &= 10.27 – (2.23 × 1.25) \\ & = 7.48 \\ \text{95% CI}_\text{upper boundary} &= \overline{X}+(2.23 \times SE) \\ &= 10.27 + (2.23 × 1.25) \\ &= 13.06 \end{aligned} \]

Task 2.12

At a rival club to the one I support, they similarly measured the number of consecutive games it took their players before they reached the red zone. The data are: 6, 17, 7, 3, 8, 9, 4, 13, 11, 14, 7. Calculate the mean, standard deviation, and confidence interval for these data.

First we need to compute the mean:

\[ \begin{aligned} \overline{X} &= \frac{\sum_{i=1}^{n} x_i}{n} \\ &= \frac{6+17+7+3+8+9+4+13+11+14+7}{11} \\ &= \frac{99}{11} \\ &= 9.00 \end{aligned} \]

Then the standard deviation, which we do as follows:

Table 6: Calculating sums of squares
Score Error (score - mean) Error squared
6 -3 9
17 8 64
7 -2 4
3 -6 36
8 -1 1
9 0 0
4 -5 25
13 4 16
11 2 4
14 5 25
7 -2 4
Total 188

The sum of squared errors is (Table 6):

\[ \begin{aligned} \text{SS} &= 9 + 64 + 4 + 36 + 1 + 0 + 25 + 16 + 4 + 25 + 4 \\ &= 188 \\ \end{aligned} \]

The variance is the sum of squared errors divided by the degrees of freedom:

\[ \begin{aligned} s^2 &= \frac{SS}{N - 1} \\ &= \frac{188}{10} \\ &= 18.8 \end{aligned} \]

The standard deviation is the square root of the variance:

\[ \begin{aligned} s &= \sqrt{s^2} \\ &= \sqrt{18.8} \\ &= 4.34 \end{aligned} \]

There were 11 sportspeople in the sample, so the standard error will be: \[ SE = \frac{s}{\sqrt{N}} = \frac{4.34}{\sqrt{11}} = 1.31\]

The sample is small, so to calculate the confidence interval we need to find the appropriate value of t. First we need to calculate the degrees of freedom, \(N − 1\). With 11 data points, the degrees of freedom are 10. For a 95% confidence interval we can look up the value in the column labelled ‘Two-Tailed Test’, ‘0.05’ in the table of critical values of the t-distribution (Appendix). The corresponding value is 2.23. The confidence intervals is, therefore, given by:

\[ \begin{aligned} \text{95% CI}_\text{lower boundary} &= \overline{X}-(2.23\times SE)) \\ &= 9 – (2.23 × 1.31) \\ & = 6.08 \\ \text{95% CI}_\text{upper boundary} &= \overline{X}+(2.23\times SE) \\ &= 9 + (2.23 × 1.31) \\ &= 11.92 \end{aligned} \]

Task 2.13

In Chapter 1 (Task 9) we looked at the length in days of 11 celebrity marriages. Here are the approximate lengths in months of nine marriages, one being mine and the others being those of some of my friends and family. In all but two cases the lengths are calculated up to the day I’m writing this, which is 20 June 2023, but the 3- and 111-month durations are marriages that have ended – neither of these is mine, in case you’re wondering: 3, 144, 267, 182, 159, 152, 693, 50, and 111. Calculate the mean, standard deviation and confidence interval for these data.

First we need to compute the mean:

\[ \begin{aligned} \overline{X} &= \frac{\sum_{i=1}^{n} x_i}{n} \\ &= \frac{3 + 144 + 267 + 182 + 159 + 152 + 693 + 50 + 111}{9} \\ &= \frac{1761}{9} \\ &= 195.67 \end{aligned} \]

Compute the standard deviation as follows:

Table 7: Calculating sums of squares
Score Error (score - mean) Error squared
3 -192.67 37121.73
144 -51.67 2669.79
267 71.33 5087.97
182 -13.67 186.87
159 -36.67 1344.69
152 -43.67 1907.07
693 497.33 247337.13
50 -145.67 21219.75
111 -84.67 7169.01
Total 324044

The sum of squared errors is (Table 7):

\[ \begin{aligned} \text{SS} &= 37121.73 + 2669.79 + 5087.97 + 186.87 + 1344.69 + 1907.07 + 247337.13 + 21219.75 + 7169.01 \\ &= 324044 \\ \end{aligned} \]

The variance is the sum of squared errors divided by the degrees of freedom:

\[ \begin{aligned} s^2 &= \frac{SS}{N - 1} \\ &= \frac{324044}{8} \\ &= 40505.5 \end{aligned} \]

The standard deviation is the square root of the variance:

\[ \begin{aligned} s &= \sqrt{s^2} \\ &= \sqrt{40505.5} \\ &= 201.2598 \end{aligned} \]

The standard error is:

\[ \begin{aligned} SE &= \frac{s}{\sqrt{N}} \\ &= \frac{201.2598}{\sqrt{9}} \\ &= 67.0866 \end{aligned} \]

The sample is small, so to calculate the confidence interval we need to find the appropriate value of t. First we need to calculate the degrees of freedom, \(N − 1\). With 9 data points, the degrees of freedom are 8. For a 95% confidence interval we can look up the value in the column labelled ‘Two-Tailed Test’, ‘0.05’ in the table of critical values of the t-distribution (Appendix). The corresponding value is 2.31. The confidence interval is, therefore, given by:

\[ \begin{aligned} \text{95% CI}_\text{lower boundary} &= \overline{X}-(2.31 \times SE)) \\ &= 195.67 – (2.31 × 67.0866) \\ & = 40.70 \\ \text{95% CI}_\text{upper boundary} &= \overline{X}+(2.31 \times SE) \\ &= 195.67 + (2.31 × 67.0866) \\ &= 350.64 \end{aligned} \]

Chapter 3

Task 3.1

What is an effect size and how is it measured?

An effect size is an objective and standardized measure of the magnitude of an observed effect. Measures include Cohen’s d, the odds ratio and Pearson’s correlations coefficient, r. Cohen’s d, for example, is the difference between two means divided by either the standard deviation of the control group, or by a pooled standard deviation.

Task 3.2

In Chapter 1 (Task 8) we looked at an example of how many games it took a sportsperson before they hit the ‘red zone’, then in Chapter 2 we looked at data from a rival club. Compute and interpret Cohen’s \(\hat{d}\) for the difference in the mean number of games it took players to become fatigued in the two teams mentioned in those tasks.

Cohen’s d is defined as:

\[ \hat{d} = \frac{\bar{X_1}-\bar{X_2}}{s} \]

There isn’t an obvious control group, so let’s use a pooled estimate of the standard deviation:

\[ \begin{aligned} s_p &= \sqrt{\frac{(N_1-1) s_1^2+(N_2-1) s_2^2}{N_1+N_2-2}} \\ &= \sqrt{\frac{(11-1)4.15^2+(11-1)4.34^2}{11+11-2}} \\ &= \sqrt{\frac{360.23}{20}} \\ &= 4.24 \end{aligned} \]

Therefore, Cohen’s \(\hat{d}\) is:

\[ \hat{d} = \frac{10.27-9}{4.24} = 0.30 \]

Therefore, the second team fatigued in fewer matches than the first team by about 1/3 standard deviation. By the benchmarks that we probably shouldn’t use, this is a small to medium effect, but I guess if you’re managing a top-flight sports team, fatiguing 1/3 of a standard deviation faster than one of your opponents could make quite a substantial difference to your performance and team rotation over the season.

Task 3.3

Calculate and interpret Cohen’s \(\hat{d}\) for the difference in the mean duration of the celebrity marriages in Chapter 1 (Task 9) and me and my friend’s marriages (Chapter 2, Task 13).

Cohen’s \(\hat{d}\) is defined as:

\[ \hat{d} = \frac{\bar{X_1}-\bar{X_2}}{s} \]

There isn’t an obvious control group, so let’s use a pooled estimate of the standard deviation:

\[ \begin{aligned} s_p &= \sqrt{\frac{(N_1-1) s_1^2+(N_2-1) s_2^2}{N_1+N_2-2}} \\ &= \sqrt{\frac{(11-1)476.29^2+(9-1)8275.91^2}{11+9-2}} \\ &= \sqrt{\frac{550194093}{18}} \\ &= 5528.68 \end{aligned} \]

Therefore, Cohen’s d is: \[\hat{d} = \frac{5057-238.91}{5528.68} = 0.87\] Therefore, my friend’s marriages are 0.87 standard deviations longer than the sample of celebrities. By the benchmarks that we probably shouldn’t use, this is a large effect.

Task 3.4

What are the problems with null hypothesis significance testing?

  • We can’t conclude that an effect is important because the p-value from which we determine significance is affected by sample size. Therefore, the word ‘significant’ is meaningless when referring to a p-value.
  • The null hypothesis is never true. If the p-value is greater than .05 then we can decide to reject the alternative hypothesis, but this is not the same thing as the null hypothesis being true: a non-significant result tells us is that the effect is not big enough to be found but it doesn’t tell us that the effect is zero.
  • A significant result does not tell us that the null hypothesis is false (see text for details).
  • It encourages all or nothing thinking: if p < 0.05 then an effect is significant, but if p > 0.05 it is not. So, a p = 0.0499 is significant but a p = 0.0501 is not, even though these ps differ by only 0.0002.

Task 3.5

What is the difference between a confidence interval and a credible interval?

A 95% confidence interval is set so that before the data are collected there is a long-run probability of 0.95 (or 95%) that the interval will contain the true value of the parameter. This means that in 100 random samples, the intervals will contain the true value in 95 of them but won’t in 5. Once the data are collected, your sample is either one of the 95% that produces an interval containing the true value, or one of the 5% that does not. In other words, having collected the data, the probability of the interval containing the true value of the parameter is either 0 (it does not contain it) or 1 (it does contain it), but you do not know which. A credible interval is different in that it reflects the plausible probability that the interval contains the true value. For example, a 95% credible interval has a plausible 0.95 probability of containing the true value.

Task 3.6

What is a meta-analysis?

Meta-analysis is where effect sizes from different studies testing the same hypothesis are combined to get a better estimate of the size of the effect in the population.

Task 3.7

Describe what you understand by the term Bayes factor.

The Bayes factor is the ratio of the probability of the data given the alternative hypothesis to that of the data given the null hypothesis. A Bayes factor less than 1 supports the null hypothesis (it suggests the data are more likely given the null hypothesis than the alternative hypothesis); conversely, a Bayes factor greater than 1 suggests that the observed data are more likely given the alternative hypothesis than the null. Values between 1 and 3 are considered evidence for the alternative hypothesis that is ‘barely worth mentioning’, values between 3 and 10 are considered to indicate evidence for the alternative hypothesis that ‘has substance’, and values greater than 10 are strong evidence for the alternative hypothesis.

Task 3.8

Various studies have shown that students who use laptops in class often do worse on their modules (Payne-Carter et al., 2016). Table 3.3 (reproduced in in Table 8) shows some fabricated data that mimics what has been found. What is the odds ratio for passing the exam if the student uses a laptop in class compared to if they don’t?

Table 8: Number of people who passed or failed an exam classified by whether they take their laptop to class
Laptop No Laptop Sum
Pass 24 49 73
Fail 16 11 27
Sum 40 60 100

First we compute the odds of passing when a laptop is used in class:

\[ \begin{aligned} \text{Odds}_{\text{pass when laptop is used}} &= \frac{\text{Number of laptop users passing exam}}{\text{Number of laptop users failing exam}} \\ &= \frac{24}{16} \\ &= 1.5 \end{aligned} \]

Next we compute the odds of passing when a laptop is not used in class:

\[ \begin{aligned} \text{Odds}_{\text{pass when laptop is not used}} &= \frac{\text{Number of students without laptops passing exam}}{\text{Number of students without laptops failing exam}} \\ &= \frac{49}{11} \\ &= 4.45 \end{aligned} \]

The odds ratio is the ratio of the two odds that we have just computed:

\[ \begin{aligned} \text{Odds Ratio} &= \frac{\text{Odds}_{\text{pass when laptop is used}}}{\text{Odds}_{\text{pass when laptop is not used}}} \\ &= \frac{1.5}{4.45} \\ &= 0.34 \end{aligned} \]

The odds of passing when using a laptop are 0.34 times those when a laptop is not used. If we take the reciprocal of this, we could say that the odds of passing when not using a laptop are 2.97 times those when a laptop is used.

Task 3.9

From the data in Table 3.1 (reproduced in Table 8) what is the conditional probability that someone used a laptop given that they passed the exam, p(laptop|pass). What is the conditional probability of that someone didn’t use a laptop in class given they passed the exam, p(no laptop |pass)?

The conditional probability that someone used a laptop given they passed the exam is 0.33, or a 33% chance:

\[ p(\text{laptop|pass})=\frac{p(\text{laptop ∩ pass})}{p(\text{pass})}=\frac{{24}/{100}}{{73}/{100}}=\frac{0.24}{0.73}=0.33 \]

The conditional probability that someone didn’t use a laptop in class given they passed the exam is 0.67 or a 67% chance.

\[ p(\text{no laptop|pass})=\frac{p(\text{no laptop ∩ pass})}{p(\text{pass})}=\frac{{49}/{100}}{{73}/{100}}=\frac{0.49}{0.73}=0.67 \]

Task 3.10

Using the data in Table 3.1 (reproduced in Table 8), what are the posterior odds of someone using a laptop in class (compared to not using one) given that they passed the exam?

The posterior odds are the ratio of the posterior probability for one hypothesis to another. In this example it would be the ratio of the probability that a used a laptop given that they passed (which we have already calculated above to be 0.33) to the probability that they did not use a laptop in class given that they passed (which we have already calculated above to be 0.67). The value turns out to be 0.49, which means that the probability that someone used a laptop in class if they passed the exam is about half of the probability that someone didn’t use a laptop in class given that they passed the exam.

\[ \text{posterior odds}= \frac{p(\text{hypothesis 1|data})}{p(\text{hypothesis 2|data})} = \frac{p(\text{laptop|pass})}{p(\text{no laptop| pass})} = \frac{0.33}{0.67} = 0.49 \]

Chapter 4

Task 4.1

Smart Alex’s first task for this chapter is to set up an RStudio project (you should have done this already if you’ve been following the chapter) and within the main project folder create folders called ‘data’, and ‘quarto’. Then create a Quarto document and follow the self-tests in the chapter to create a document containing some code and text.

This is a general video on workflow (note that I name the quarto folder as r_docs in this tutorial, but otherwise it follows the book chapter):

You can download a complete version of this file (chapter_4_self_tests.qmd) and it’s rendered output (chapter_4_self_tests.html) from www.discovr.rocks/repository/chapter_4_self_tests.zip.

Task 4.2

Render the document in Task 1, then locate the rendered file within your project folder.

A general tutorial on rendering is here:

You can download a complete version of this file (chapter_4_self_tests.qmd) and it’s rendered output (chapter_4_self_tests.html) from www.discovr.rocks/repository/chapter_4_self_tests.zip.

Task 4.3

The data below show the score (out of 20) for 20 different students, some of whom are male and some female, and some of whom were taught using positive reinforcement (being nice) and others who were taught using punishment (electric shock). Enter these data into a tibble. (Hint: the data below are in messy format, you need to enter them in tidy format.)

We have three variables here: sex (was the person male or female), the method of teaching they underwent and their mark on an assignment. Therefore, the tidy format is to arrange the data in three columns. There are several ways to do this, here’s one of them:

teaching_tib <- tibble::tibble(.rows = 20) |>
  mutate(
        method = c(rep("Electric shock", 10), rep("Being nice", 10)) |> as_factor(),
        sex = c(rep("Female", 5), rep("Male", 5)) |> rep(2) |> as_factor(),
        mark = c(6, 7, 5, 4, 8, 15, 14, 20, 13, 13, 12, 10, 7, 8, 13, 10, 9, 8, 6, 7)
        )

The data can be found in discovr::teaching (this version has an extra variable with the participant id).

Task 4.4

Thinking back to Labcoat Leni’s Real Research 4.1, Oxoby also measured the minimum acceptable offer; these MAOs (in dollars) are below (again, they are approximations based on the graphs in the paper). Enter these data into a tibble.

  • Bon Scott group: 2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5
  • Brian Johnson group: 0, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 1
moa_tib <- tibble(.rows = 36) |> 
  mutate(
    singer = c(rep("Bon Scott", 18), rep("Brian Johnson", 18)) |> as_factor(),
    mao = c(2, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 0, 1, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 1)
    )

The full data can be found in discovr::acdc

Task 4.5

According to some highly unscientific research done by a UK department store chain and reported in Marie Claire magazine, shopping is good for you. They found that the average woman spends 150 minutes and walks 2.6 miles when she shops, burning off around 385 calories. In contrast, men spend only about 50 minutes shopping, covering 1.5 miles. This was based on strapping a pedometer on a mere 10 participants. Although I don’t have the actual data, some simulated data based on these means are below. Enter these data into a tibble (again, you need to think about making the data tidy format).

We have three variables here: sex (was the person male or female), the distance they walked and the time they spent shopping. Therefore, we arrange the data in three columns:

shopping_tib <- tibble(.rows = 10) |>
  mutate(
    sex = c(rep("Male", 5), rep("Female", 5)) |> as_factor(),
    distance = c(0.16, 0.40, 1.36, 1.99, 3.61, 1.40, 1.81, 1.96, 3.02, 4.82),
    time = c(15, 30, 37, 65, 103, 22, 140, 160, 183, 245)
    )

The data can be found in discovr::shopping.

Task 4.6

I wondered whether a fish or cat made a better pet. I found some people who had either fish or cats as pets and measured their life satisfaction and how much they like animals. Enter these data into a tibble.

We have three variables here: pet (did the person have a fish or a cat as a pet?), their love of animals (animal) and their life_satisfaction score. The tidy format is to arrange the data in three columns. There are several ways to do this, here’s one of them:

pet_tib <- tibble(.rows = 20) |>
  mutate(
        pet = c(rep("Fish", 12), rep("Cat", 8)) |> as_factor(),
        animal = c(69, 25, 31, 29, 12, 49, 25, 35, 51, 40, 23, 37, 16, 65, 39, 35, 19, 53, 27, 44),
        life_satisfaction = c(47, 6, 47, 33, 13, 56, 42, 51, 42, 46, 27, 48, 52, 66, 65, 61, 60, 68, 37, 72)
        )

The full data can be found in discovr::pets.

Task 4.7

One of my favourite activities, especially when trying to do brain-melting things like writing statistics books, is drinking tea. I am English, after all. Fortunately, tea improves your cognitive function – well, it does in old Chinese people at any rate (Feng et al., 2010). I may not be Chinese and I’m not that old, but I nevertheless enjoy the idea that tea might help me think. Here are some data based on Feng et al.’s study that measured the number of cups of tea drunk per day and cognitive functioning (out of 80) in 15 people. Enter these data into a tibble.

We have two variables here: the number of cups of tea a person drinks per day (tea) and their cognitive functioning out of 80 (cog_fun) The tidy format is to arrange the data in three columns. There are several ways to do this, here’s one of them:

tea_tib <- tibble(.rows = 15) |>
  mutate(
        tea = c(2, 4, 3, 4, 2, 3, 5, 5, 2, 5, 1, 3, 3, 4, 1),
        cog_fun = c(60, 47, 31, 62, 44, 41, 49, 56, 45, 56, 57, 40, 54, 34, 46)
        )

The data (with an additional variable containing the participant id) can be found in discovr::tea_15.

Task 1.8

Statistics and maths anxiety are common and affect people’s performance on maths and stats assignments; women in particular can lack confidence in mathematics (Field, 2010). Zhang et al. (2013) did an intriguing study in which students completed a maths test in which some put their own name on the test booklet, whereas others were given a booklet that already had either a male or female name on. Participants in the latter two conditions were told that they would use this other person’s name for the purpose of the test. Women who completed the test using a different name performed significantly better than those who completed the test using their own name. (There were no such significant effects for men.) The data below are a random subsample of Zhang et al.’s data. Enter them into a tibble.

The design of this study is such that different people were put in one of three different conditions (female fake name, male fake name, own name), but the groups are not equal. In the female fake name group there were 10 females but only 7 males, in the fake male name there were 10 females and 9 males, and in the own name condition 7 females and 9 males. With a bit of adding we can see that there were 27 females in total, and 25 males. We’re going to have to use a lot of rep() statements and keep our wits about us!

We have three variables: whether the participant was male or female (sex), which booklet condition they were in (name_type) and their test score (accuracy). I’ve also included a participant id, just so that your data matches the file zhang_2013_subsample.csv that I provide for you. We can enter the data as follows (there are other ways too …):

zhang_tib <- tibble(.rows = 52) |>
  mutate(
    id = c(171, 35, 57, 36, 53, 176, 76, 184, 64, 166, 14, 100, 30, 49, 157, 14, 68, 71, 4, 40, 66, 27, 61, 27, 36, 33, 120, 113, 95, 99, 78, 32, 43, 183, 103, 31, 86, 54, 5, 20, 13, 59, 58, 188, 187, 15, 50, 9, 45, 60, 73, 189) |> as_factor(),
    sex = c(rep("Female", 27), rep("Male", 25)) |> as_factor(),
    name_type = c(rep("Female fake name", 10), rep("Male fake name", 10), rep("Own name", 7), rep("Female fake name", 7), rep("Male fake name", 9), rep("Own name", 9)) |> as_factor(),
        accuracy = c(33, 22, 46, 53, 14, 27, 64, 62, 75, 50, 69, 60, 82, 78, 38, 63, 46, 27, 61, 29, 75, 33, 83, 42, 10, 44, 27, 53, 47, 87, 41, 62, 67, 57, 31, 63, 34, 40, 22, 17, 60, 47, 57, 70, 57, 33, 83, 86, 65, 64, 37, 80)
        )

The data are in discovr::zhang_sample.

Task 4.9

What is a factor variable?

A variable in which numbers are used to represent group or category membership. An example would be a variable in which a score of 1 represents a person being female, and a 0 represents them being male.

Task 4.10

What is the difference between messy and tidy format data?

Tidy format data are arranged such that scores on an outcome variable appear in a single column and rows represent a combination of the attributes of those scores (for example, the entity from which the scores came, when the score was recorded etc.). In tidy format data, scores from a single entity can appear over multiple rows where each row represents a combination of the attributes of the score (e.g., levels of an independent variable or time point at which the score was recorded etc.) In contrast, messy format data are arranged such that scores from a single entity appear in a single row and levels of independent or predictor variables are arranged over different columns. As such, in designs with multiple measurements of an outcome variable within a case the outcome variable scores will be contained in multiple columns each representing a level of an independent variable, or a timepoint at which the score was observed. Columns can also represent attributes of the score or entity that are fixed over the duration of data collection (e.g., participant sex, employment status etc.).

Chapter 5

Task 5.1

The file students.csv contains some fictional data about students and lecturers including their number of friends, income (per anum), alcohol consumption per week (in units) and neuroticism (neurotic). Plot and interpret an error bar chart showing the mean number of friends for students and lecturers.

First, load the data (see Tip 2) using

student_tib <- discovr::students

To find out more about the data execute ?students. The code for the plot (Figure 1) is:

ggplot(student_tib, aes(group, friends)) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange") +
  coord_cartesian(ylim = c(0, 20)) +
  scale_y_continuous(breaks = seq(0, 20, 2)) +
  labs(x = "Group", y = "Number of friends") +
  theme_minimal()
Figure 1

We can conclude that, on average, students had more friends than lecturers.

Task 5.2

Using the same data, plot and interpret an error bar chart showing the mean alcohol consumption for students and lecturers.

The code for the plot in Figure 2 is:

ggplot(student_tib, aes(group, alcohol)) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange") +
  coord_cartesian(ylim = c(0, 30)) +
  scale_y_continuous(breaks = seq(0, 30, 5)) +
  labs(x = "Group", y = "Alcohol consumption per week (units)") +
  theme_minimal()
Figure 2

We can conclude that, on average, students and lecturers drank similar amounts, but the error bars tell us that the mean is a better representation of the population for students than for lecturers (there is more variability in lecturers’ drinking habits compared to students’).

Task 5.3

Using the same data, plot and interpret an error line chart showing the mean income for students and lecturers.

The code for the plot in Figure 3 is:

ggplot(student_tib, aes(group, income)) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange") +
  coord_cartesian(ylim = c(0, 50000)) +
  scale_y_continuous(breaks = seq(0, 50000, 5000)) +
  labs(x = "Group", y = "Income (p.a.)") +
  theme_minimal()
Figure 3

We can conclude that, on average, students earn less than lecturers, but the error bars tell us that the mean is a better representation of the population for students than for lecturers (there is more variability in lecturers’ income compared to students’).

Task 5.4

Using the same data, plot and interpret error a line chart showing the mean neuroticism for students and lecturers.

The code for the plot in Figure 4 is:

ggplot(student_tib, aes(group, neurotic)) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange") +
  coord_cartesian(ylim = c(0, 20)) +
  scale_y_continuous(breaks = seq(0, 20, 1)) +
  labs(x = "Group", y = "Neuroticism") +
  theme_minimal()
Figure 4

We can conclude that, on average, students are slightly less neurotic than lecturers.

Task 5.5

Using the same data, plot and interpret a scatterplot with regression lines of alcohol consumption and neuroticism grouped by lecturer/student.

The code for the plot in Figure 5 is:

ggplot(student_tib, aes(alcohol, neurotic, colour = group)) +
  geom_point() +
  geom_smooth(method = "lm", aes(fill = group), alpha = 0.1) +
  coord_cartesian(xlim = c(0, 30), ylim = c(0, 30)) +
  scale_x_continuous(breaks = seq(0, 30, 5)) +
  scale_y_continuous(breaks = seq(0, 30, 5)) +
  labs(x = "Alcohol consumption per week (units)", y = "Neuroticism", colour = "Group", fill = "Group") +
  theme_minimal()
Figure 5

We can conclude that for lecturers, as neuroticism increases so does alcohol consumption (a positive relationship), but for students the opposite is true, as neuroticism increases alcohol consumption decreases.

Task 5.6

Using the same data, plot and interpret a scatterplot matrix with regression lines of alcohol consumption, neuroticism and number of friends.

The code for the plot in Figure 5 is:

GGally::ggpairs(student_tib, columns = c("alcohol", "neurotic", "friends"))
Figure 6

We can conclude that there is no relationship (flat line) between the number of friends and alcohol consumption; there was a negative relationship between how neurotic a person was and their number of friends ; and there was a slight positive relationship between how neurotic a person was and how much alcohol they drank.

Task 5.7

Using the zhang_2013_subsample.csv data from Chapter 1 (see Smart Alex’s task) plot a clustered error bar chart of the mean test accuracy as a function of the type of name participants completed the test under (x-axis) and whether they were male or female (different coloured bars).

First, let’s load the data (see Tip 2)

zhang_tib <- discovr::zhang_sample

To find out more about the data execute ?zhang_sample.

To get the plot in Figure 7 execute:

ggplot(zhang_tib, aes(name_type, accuracy, colour = sex)) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.5)) +
  coord_cartesian(ylim = c(0, 80)) +
  scale_y_continuous(breaks = seq(0, 80, 10)) +
  labs(x = "Type of name", y = "Accuracy (%)", colour = "Biological sex") +
  theme_minimal()
Figure 7

The plot shows that, on average, males did better on the test than females when using their own name (the control) but also when using a fake female name. However, for participants who did the test under a fake male name, the women did better than males.

Task 5.8

Using the teaching.csv data from Chapter 4 (Task 3), plot a clustered error line chart of the mean score when electric shocks were used compared to being nice, and plot males and females as different coloured lines.

First, let’s load the data (see Tip 2):

teaching_tib <- discovr::teaching

To find out more about the data execute ?teaching. To get the plot in Figure 8 execute:

ggplot(teaching_tib, aes(method, mark, colour = sex)) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.5)) +
  coord_cartesian(ylim = c(0, 20)) +
  scale_y_continuous(breaks = seq(0, 20, 1)) +
  labs(x = "Method of teaching", y = "Mark (out of 20)", colour = "Biological sex") +
  theme_minimal()
Figure 8

We can see that when the being nice method of teaching is used, males and females have comparable scores on their coursework, with females scoring slightly higher than males on average, although their scores are also more variable than the males’ scores as indicated by the longer error bar). However, when an electric shock is used, males score higher than females but there is more variability in the males’ scores than the females’ for this method (as seen by the longer error bar for males than for females). Additionally, the graph shows that females score higher when the being nice method is used compared to when an electric shock is used, but the opposite is true for males. This suggests that there may be an interaction effect of sex.

Task 5.9

Using the shopping_exercise.csv data from Chapter 3, plot two error bar graphs comparing men and women (x-axis): one for the distance walked, and the other of the time spent shopping.

First, let’s load the data (see Tip 2):

shopping_tib <- discovr::shopping

To find out more about the data execute ?shopping. To get the plot in Figure 9 execute:

ggplot(shopping_tib, aes(sex, distance)) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.5)) +
  coord_cartesian(ylim = c(0, 5)) +
  scale_y_continuous(breaks = seq(0, 5, 1)) +
  labs(x = "Biological sex", y = "Distance walked (Miles)") +
  theme_minimal()
Figure 9

Looking at the plot in Figure 9, we can see that, on average, females walk longer distances (probably not significantly so) while shopping than males.

To get the plot in Figure 10 execute:

ggplot(shopping_tib, aes(sex, time)) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.5)) +
  coord_cartesian(ylim = c(0, 260)) +
  scale_y_continuous(breaks = seq(0, 260, 20)) +
  labs(x = "Biological sex", y = "Time spent shopping (minutes)") +
  theme_minimal()
Figure 10

Figure 10 shows that, on average, females spend more time shopping than males (but again, probably not significantly so). The females’ scores are more variable than the males’ scores (longer error bar).

Task 5.10

Using the pets.sav data from Chapter 4 (Task 6), plot two error bar charts comparing scores when having a fish or cat as a pet (x-axis): one for the animal liking variable, and the other for the life satisfaction.

First, let’s load the data (see Tip 2):

pet_tib <- discovr::pets

To find out more about the data execute ?pets. To get the first plot execute:

ggplot(pet_tib, aes(pet, animal)) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.5)) +
  coord_cartesian(ylim = c(0, 60)) +
  scale_y_continuous(breaks = seq(0, 60, 5)) +
  labs(x = "Species of pet", y = "Liking of animals") +
  theme_minimal()
Figure 11

Figure 11 shows that the mean love of animals was the same for people with cats and fish as pets. To get the second plot execute:

ggplot(pet_tib, aes(pet, life_satisfaction)) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.5)) +
  coord_cartesian(ylim = c(0, 80)) +
  scale_y_continuous(breaks = seq(0, 80, 10)) +
  labs(x = "Species of pet", y = "Life satisfaction") +
  theme_minimal()
Figure 12

Figure 12 shows that, on average, life satisfaction was higher for people with a cat as a pety than those with a fish.

Task 5.11

Using the same data as above, plot a scatterplot of animal liking scores against life satisfaction (plot scores for those with fish and cats in different colours).

The code will be something like:

ggplot(pet_tib, aes(animal, life_satisfaction, colour = pet)) +
  geom_point() +
  geom_smooth(method = "lm", aes(fill = pet), alpha = 0.1) +
  coord_cartesian(xlim = c(0, 75), ylim = c(0, 100)) +
  scale_x_continuous(breaks = seq(0, 75, 5)) +
  scale_y_continuous(breaks = seq(0, 100, 5)) +
  labs(x = "Animal liking", y = "Life satisfaction", colour = "Species of pet",  fill = "Species of pet") +
  theme_minimal()
Figure 13

We can conclude from Figure 13 that for for regardles sof whether someone had a cat or fish as a pet, as love of animals increases so does life satisfaction (a positive relationship).

Task 5.12

Using the tea_15.csv data from Chapter 4, plot a scatterplot showing the number of cups of tea drunk (`tea) on the x-axis against cognitive functioning (cog_fun) and the y-axis.

First, let’s load the data (see Tip 2):

#
tea15_tib <- discovr::tea_15

To find out more about the data execute ?tea_15. The code will be something like:

ggplot(tea15_tib, aes(tea, cog_fun)) +
  geom_point() +
  geom_smooth(method = "lm") +
  coord_cartesian(xlim = c(0, 6), ylim = c(0, 70)) +
  scale_x_continuous(breaks = seq(0, 6, 1)) +
  scale_y_continuous(breaks = seq(0, 70, 5)) +
  labs(x = "Cups of tea drunk (per day)", y = "Cognitive functioning") +
  theme_minimal()
Figure 14

The scatterplot (and near-flat line especially) in Figure 14 tells us that there is a tiny relationship (practically zero) between the number of cups of tea drunk per day and cognitive function.

Chapter 6

Task 6.1

Using the notebook.csv data from Chapter 5, create and interpret a Q-Q plot for the two films (ignore sex)?

Load the data directly from the discovr package (see Tip 2).

notebook_tib <- discovr::notebook

Figure 15 shows the plot.

ggplot(data = notebook_tib, aes(sample = arousal)) +
  qqplotr::stat_qq_band(fill = "#C4F134FF", alpha = 0.3) +
  qqplotr::stat_qq_line(colour = "#C4F134FF") +
  qqplotr::stat_qq_point(colour = "#3E9BFEFF", alpha = 0.9, size = 1) +
  labs(x = "Theoretical quantiles", y = "Sample quantiles") +
  facet_wrap(~film, ncol = 1, scales = "free") +
  theme_minimal()
Figure 15

For both films the expected quantile points are close, on the whole, to those that would be expected from a normal distribution (i.e. the dots fall within the confidence interval for the line).

Task 6.2

The file r_exam.csv contains data on students’ performance on an SPSS exam. Four variables were measured: exam (first-year SPSS exam scores as a percentage), computer (measure of computer literacy in percent), lecture (percentage of statistics lectures attended) and numeracy (a measure of numerical ability out of 15). There is a variable called uni indicating whether the student attended Sussex University (where I work) or Duncetown University. Compute and interpret summary statistics for exam, computer, lectures and numeracy for the sample as a whole.

Load the data (see Tip 2).

rexam_tib <- discovr::r_exam

Table 9 shows the table of descriptive statistics for the four variables in this example. From this table, we can see that, on average, students attended nearly 60% of lectures, obtained 58% in their exam, scored only 51% on the computer literacy test, and only 5 out of 15 on the numeracy test. In addition, the confidence interval for computer literacy was relatively narrow compared to that of the percentage of lectures attended and exam scores.

describe_distribution(x = rexam_tib,
                      select = c(exam, computer, lectures, numeracy),
                      ci = 0.95) |> 
  display()
Table 9
Variable Mean 95% CI (Mean) SD IQR Range Skewness Kurtosis n n_Missing
exam 58.10 (53.94, 61.82) 21.32 37.00 (15.00, 99.00) -0.11 -1.11 100 0
computer 50.71 (49.11, 52.09) 8.26 10.75 (27.00, 73.00) -0.17 0.36 100 0
lectures 59.77 (55.21, 63.84) 21.68 28.75 (8.00, 100.00) -0.42 -0.18 100 0
numeracy 4.85 (4.24, 5.37) 2.71 4.00 (1.00, 14.00) 0.96 0.95 100 0

Task 6.3

Produce histograms for each of the four measures in the previous task and interpret them

The data are in messy format, so first let’s make the data tidy:

rexam_tidy_tib <- rexam_tib |> 
  pivot_longer(
    cols = exam:numeracy,
    values_to = "score",
    names_to = "measure"
  )

The data now look like Table 10

DT::datatable(rexam_tidy_tib)
Table 10

To create the plot in Figure 16 use the code below.

ggplot(rexam_tidy_tib, aes(score)) +
  geom_histogram(bins = 20, fill = "#C42503FF", colour = "#C42503FF", alpha = 0.2) +
  labs(y = "Frequency", x = "Score") +
  facet_wrap(~measure, scale = "free") +
  theme_minimal()
Figure 16

The histograms show us several things. The exam scores are very interesting because this distribution is quite clearly not normal; in fact, it looks suspiciously bimodal (there are two peaks, indicative of two modes). It looks as though computer literacy is somewhat normally distributed (a few people are very good with computers and a few are very bad, but the majority of people have a similar degree of knowledge) as is the lecture attendance. Both are a little negatively skewed. Finally, the numeracy test has produced very positively skewed data (the majority of people did very badly on this test and only a few did well).

Descriptive statistics and histograms are a good way of getting an instant picture of the distribution of your data. This snapshot can be very useful: for example, the bimodal distribution of exam scores instantly indicates a trend that students are typically either very good at statistics or struggle with it (there are relatively few who fall in between these extremes). Intuitively, this finding fits with the nature of the subject: statistics is very easy once everything falls into place, but before that enlightenment occurs it all seems hopelessly difficult!

Task 6.4

Produce Q-Q plots for the four measures in Task 2.

Figure 17 was created using the code below.

ggplot(rexam_tidy_tib, aes(sample = score)) +
  qqplotr::stat_qq_band(fill = "#C4F134FF", alpha = 0.3) +
  qqplotr::stat_qq_line(colour = "#C4F134FF") +
  qqplotr::stat_qq_point(colour = "#3E9BFEFF", alpha = 0.5, size = 1) +
  labs(x = "Theoretical quantiles", y = "Sample quantiles") +
  facet_wrap(~measure, scales = "free") +
  theme_minimal()
Figure 17

Task 6.5

Use Task 2 and 4 to interpret skew and kurtosis for each of the four measures.

The skewness values are calculated in the table for task 2. For all of the measures the value of skewness is fairly close to zero, indicating the expected value of 0. The exception is numeracy scores for which skew is about 1. This is within acceptable limits. Looking at the Q-Q plots, none of them show the characteristic upward or downward bend associated with skew.

For kurtosis, computer literacy and lecture attendance have kurtosis around the expected value of 3. The Q-Q plots confirm that these measures show the characteristic patterns of normality.

Exam scores have kurtosis below 2 suggesting lighter tails than you’d expect in a normal distribution, which is also shown in the Q-Q plot with observed quantiles at the low end being higher than expected and observed quantiles at the high end being lower than expected.

Numeracy scores have kurtosis close to 4 indicating heavier tails than expected in a normal distribution, but the Q-Q plot looks relatively normal (although observed quantiles rise above the normal line at the low end.

Task 6.6

Produce summary statistics for the four measures separately for each university.

Table 11 shows the summary statistics split by university. From this table it is clear that Sussex students scored higher on both their exam and the numeracy test than their Duncetown counterparts. In fact, looking at the means reveals that, on average, Sussex students scored an amazing 36% more on the exam than Duncetown students, and had higher numeracy scores too (what can I say, my students are the best). Skew statistics are all close to the expected value of 0 and most of the kurtosis values are close to the expected value of 3. The notable exception are the computer literacy scores in the Sussex students, which have a kurtosis value above 4 suggesting heavy tails.

rexam_tib |> 
  group_by(uni) |> 
  describe_distribution(select = c(exam, computer, lectures, numeracy),
                      ci = 0.95) |> 
  display()
Table 11
uni Variable Mean 95% CI (Mean) SD IQR Range Skewness Kurtosis n n_Missing
Duncetown University exam 40.18 (37.34, 42.82) 12.59 17.50 (15.00, 66.00) 0.31 -0.57 50 0
Duncetown University computer 50.26 (48.15, 52.70) 8.07 12.25 (35.00, 67.00) 0.23 -0.51 50 0
Duncetown University lectures 56.26 (48.80, 61.99) 23.77 28.75 (8.00, 100.00) -0.31 -0.38 50 0
Duncetown University numeracy 4.12 (3.63, 4.66) 2.07 3.25 (1.00, 9.00) 0.51 -0.48 50 0
Sussex University exam 76.02 (73.62, 78.46) 10.21 12.25 (56.00, 99.00) 0.27 -0.26 50 0
Sussex University computer 51.16 (49.01, 53.13) 8.51 9.25 (27.00, 73.00) -0.54 1.38 50 0
Sussex University lectures 63.27 (57.45, 67.98) 18.97 30.00 (12.50, 100.00) -0.36 -0.22 50 0
Sussex University numeracy 5.58 (4.82, 6.49) 3.07 5.00 (1.00, 14.00) 0.79 0.26 50 0

Task 6.7

Produce and interpret histograms for the four measures split by the two universities.

ggplot(rexam_tidy_tib, aes(score)) +
  geom_histogram(bins = 20, fill = "#C42503FF", colour = "#C42503FF", alpha = 0.2) +
  labs(y = "Frequency", x = "Score") +
  facet_grid(uni~measure, scale = "free") +
  theme_minimal()
Figure 18

The histograms of these variables split according to the university attended show numerous things. The first interesting thing to note is that for exam marks, the distributions are both fairly normal. This seems odd because the overall distribution was bimodal. However, it starts to make sense when you consider that for Duncetown the distribution is centred around a mark of about 40%, but for Sussex the distribution is centred around a mark of about 76%. This illustrates how important it is to look at distributions within groups. If we were interested in comparing Duncetown to Sussex it wouldn’t matter that overall the distribution of scores was bimodal; all that’s important is that each group comes from a normal population, and in this case it appears to be true. When the two samples are combined, these two normal distributions create a bimodal one (one of the modes being around the centre of the Duncetown distribution, and the other being around the centre of the Sussex data!).

For numeracy scores, the distribution is slightly positively skewed (there is a larger concentration at the lower end of scores) in both the Duncetown and Sussex groups. Therefore, the overall positive skew observed before is due to the mixture of universities.

The remaining distributions look vaguely normal. The only exception is the computer literacy scores for the Sussex students. This is a fairly flat distribution apart from a huge peak between 50 and 60%. It looks potentially slightly heavy-tailed. This suggests positive kurtosis and, in fact kurtosis is above 4 in this group (see previous task).

Chapter 7

Task 7.1

A student was interested in whether there was a positive relationship between the time spent doing an essay and the mark received. He got 45 of his friends and timed how long they spent writing an essay (hours) and the percentage they got in the essay (essay). He also translated these grades into their degree classifications (grade): in the UK, a student can get a first-class mark (the best), an upper-second-class mark, a lower second, a third, a pass or a fail (the worst). Using the data in the file essay_marks.csv find out what the relationship was between the time spent doing an essay and the eventual mark in terms of percentage and degree class (draw a scatterplot too).

Load the data using (see Tip 2):

essay_tib <- discovr::essay_marks

We’re interested in looking at the relationship between hours spent on an essay and the grade obtained. We could create a scatterplot of hours spent on the essay (x-axis) and essay mark (y-axis). In Figure 19 I’ve chosen to highlight the degree classification grades using different colours.

ggplot(essay_tib, aes(hours, essay, colour = grade)) +
  geom_point() +
  geom_smooth(method = "lm", alpha = 0.1) +
  coord_cartesian(xlim = c(0, 15), ylim = c(20, 80)) +
  scale_x_continuous(breaks = seq(0, 15, 1)) +
  scale_y_continuous(breaks = seq(20, 80, 5)) +
  labs(x = "Time spent on essay (hours)", y = "Essay mark (%)", colour = "Grade", fill = "Grade") +
  theme_minimal()
Figure 19

To get the correlation:

Table 12
essay_r <- correlation(data = essay_tib, select = c("hours", "essay"))

Table 13 indicate that the relationship between time spent writing an essay and grade awarded was not significant, \(r\) = 0.27 (-0.03, 0.52), t(43) = 1.81, p = 0.077.

display(essay_r)
Table 13: Correlation Matrix (pearson-method)
Parameter1 Parameter2 r 95% CI t(43) p
hours essay 0.27 (-0.03, 0.52) 1.81 0.077

p-value adjustment method: Holm (1979) Observations: 45

The second part of the question asks us to do the same analysis but when the percentages are recoded into degree classifications. The degree classifications are ordinal data (not interval): they are ordered categories. So we shouldn’t use Pearson’s test statistic, but Spearman’s and Kendall’s ones instead. grade is a factor:

essay_tib$grade
 [1] Upper second class First class        Third class        First class       
 [5] Lower second class Upper second class Upper second class Upper second class
 [9] Upper second class Third class        Upper second class First class       
[13] First class        Lower second class Lower second class Upper second class
[17] Upper second class Upper second class Upper second class Lower second class
[21] Lower second class First class        Upper second class Upper second class
[25] First class        Upper second class Lower second class Lower second class
[29] Lower second class First class        Upper second class Upper second class
[33] Upper second class First class        First class        Upper second class
[37] Upper second class Upper second class Upper second class Upper second class
[41] Upper second class Upper second class Lower second class Lower second class
[45] First class       
Levels: First class Upper second class Lower second class Third class

We need to convert it into numbers using as.numeric:

essay_tib <- essay_tib |> 
  mutate(
    grade_num = as.numeric(grade)
    )

Let’s check the numeric grade matches the description (i.e. first = 1, upper second = 2 etc.):

essay_tib |>
  head(15) |> 
  display(digits = 2) 
id essay hours grade grade_num
qxbcjg 61.68 10.63 Upper second class 2
csykgu 69.55 7.29 First class 1
npxgpl 48.23 5.05 Third class 4
gomiwl 70.68 2.89 First class 1
qdywaa 59.90 9.55 Lower second class 3
cmaprb 61.16 11.31 Upper second class 2
nrvhix 67.62 7.47 Upper second class 2
ybytlf 64.78 8.47 Upper second class 2
talwvg 63.21 8.72 Upper second class 2
ceylea 49.69 6.20 Third class 4
berjky 63.72 4.77 Upper second class 2
cjxgpl 72.24 10.98 First class 1
gufenh 70.59 5.22 First class 1
uaxxdm 58.64 6.86 Lower second class 3
unnhpl 58.72 9.80 Lower second class 3

Now the correlation:

Table 14
grade_r <- correlation(data = essay_tib, select = c("hours", "grade_num"), method = "spearman")

Table 15 shows no significant relationship between degree grade classification for an essay and the time spent doing it, \(r_\text{s}\) = -0.19 (-0.47, 0.12), p = 0.204. Note that the direction of the relationship has reversed. This has happened because the essay marks were recoded as 1 (first), 2 (upper second), 3 (lower second), and 4 (third), so high grades were represented by low numbers.

display(grade_r)
Table 15: Correlation Matrix (spearman-method)
Parameter1 Parameter2 rho 95% CI S p
hours grade_num -0.19 (-0.47, 0.12) 18110.93 0.204

p-value adjustment method: Holm (1979) Observations: 45

Task 7.2

Using the notebook.csv data from Chapter 3, find out if the size of relationship between the participant’s gender identity and arousal.

Load the data using (see Tip 2):

notebook_tib <- discovr::notebook

gender_identity is a categorical variable with two categories (in this dataset), therefore, we need to quantify this relationship using a point-biserial correlation. First, I’ll recode gender_identity into a new variable vcalled gi_bin and represent ‘Man’ with the number 0 and ‘woman’ with the number 1.

notebook_tib <- notebook_tib |> 
  mutate(
    gi_bin = ifelse(gender_identity == "Man", 0, 1),
  )

Now the correlation:

Table 16
notebook_r <- correlation(data = notebook_tib, select = c("arousal", "gi_bin"))

Table 17 shows that there was no significant relationship between gender identity and arousal, \(r_\text{pb}\) = -0.20 (-0.48, 0.12), t(38) = -1.23, p = 0.226.

display(notebook_r)
Table 17: Correlation Matrix (pearson-method)
Parameter1 Parameter2 r 95% CI t(38) p
arousal gi_bin -0.20 (-0.48, 0.12) -1.23 0.226

p-value adjustment method: Holm (1979) Observations: 40

Task 7.3

Using the notebook data again, quantify the relationship between the film watched and arousal.

film is a categorical variable with two categories (in this dataset), so as with the previous task I’ll recode into a new variable and represent ‘The Notebook’ with the number 0 and ‘woman’ with the number 1.

notebook_tib <- notebook_tib |> 
  mutate(
    film_bin = ifelse(film == "The Notebook", 0, 1),
  )

Now the correlation:

film_r <- correlation(data = notebook_tib, select = c("arousal", "film_bin"))

Table 18 shows that there was a significant relationship between the film watched and arousal, \(r_\text{pb}\) = -0.86 (-0.93, -0.76), t(38) = -10.61, p < 0.001. Looking at how the groups were coded, you should see that The Notebook had a code of 0, and the documentary about notebooks had a code of 1, therefore the negative coefficient reflects the fact that as film goes up (changes from 0 to 1) arousal goes down. Put another way, as the film changes from The Notebook to a documentary about notebooks, arousal decreases. So The Notebook gave rise to the greater arousal levels.

display(film_r)
Table 18: Correlation Matrix (pearson-method)
Parameter1 Parameter2 r 95% CI t(38) p
arousal film_bin -0.86 (-0.93, -0.76) -10.61 < .001***

p-value adjustment method: Holm (1979) Observations: 40

Task 7.4

As a statistics lecturer I am interested in the factors that determine whether a student will do well on a statistics course. Imagine I took 25 students and looked at their grades for my statistics course at the end of their first year at university: first, upper second, lower second and third class (see Task 1). I also asked these students what grade they got in their high school maths exams. In the UK GCSEs are school exams taken at age 16 that are graded A, B, C, D, E or F (an A grade is the best). The data for this study are in the file grades.csv. To what degree does GCSE maths grade correlate with first-year statistics grade?

Load the data using Tip 2. The categories in the data file have a meaningful order, so after importing the data we transform stats and gcse to factors but also order the factor levels in the correct order (this is important!).

grade_tib <- read_csv("../data/grades.csv") |> 
  mutate(
    stats = as_factor(stats) |> fct_relevel("First class", "Upper second class", "Lower second class", "Third class", "Pass", "Fail"),
    gcse = as_factor(gcse) |> fct_relevel("A", "B", "C", "D", "E", "F")
  )

Alternative, load the data directly from the discovr package:

grade_tib <- discovr::grades

Let’s look at these variables. In the UK, GCSEs are school exams taken at age 16 that are graded A, B, C, D, E or F. These grades are categories that have an order of importance (an A grade is better than all of the lower grades). In the UK, a university student can get a first-class mark, an upper second, a lower second, a third, a pass or a fail. These grades are categories, but they have an order to them (an upper second is better than a lower second). When you have categories like these that can be ordered in a meaningful way, the data are said to be ordinal. The data are not interval, because a first-class degree encompasses a 30% range (70–100%), whereas an upper second only covers a 10% range (60–70%). When data have been measured at only the ordinal level they are said to be non-parametric and Pearson’s correlation is not appropriate. Therefore, the Spearman correlation coefficient is used.

In the file, the scores are in two columns: one labelled stats and one labelled gcse (see Table 19).

DT::datatable(grade_tib)
Table 19: The data in grade_tib

The first thing we need to do is to convert stats and gcse to numbers using as.numeric, which will return the numeric value attached to each factor level:

grade_tib <- grade_tib |> 
  mutate(
    stat_num = as.numeric(stats),
    gcse_num = as.numeric(gcse)
    )

Let’s check the numeric grade matches the description (i.e. first = 1, upper second = 2 etc.) by looking at Table 20.

grade_tib |> 
  head(10) |> 
  display()
Table 20: First 10 rows of grade_tib
id stats gcse stat_num gcse_num
wbmew First class A 1 1
ypusy First class A 1 1
sbumv First class C 1 3
wqicp Upper second class C 2 3
xnobi Upper second class C 2 3
glxdw Upper second class D 2 4
eomcn Upper second class E 2 5
bjhcl Lower second class A 3 1
wcype Lower second class B 3 2
grkvu Lower second class C 3 3

Now the correlation

statgrade_r <- correlation(data = grade_tib, select = c("stat_num", "gcse_num"), method = "spearman")

Table 21 shows the result. In the question I predicted that better grades in GCSE maths would correlate with better degree grades for my statistics course. This hypothesis is directional and so a one-tailed test could be selected; however, in the chapter I advised against one-tailed tests so I have done two-tailed. We could report these results as follows:

  • There was a significant positive relationship between a person’s statistics grade and their GCSE maths grade, \(r_\text{s}\) = 0.45 (0.06, 0.73), p = 0.022 $r_.
display(statgrade_r)
Table 21: Correlation Matrix (spearman-method)
Parameter1 Parameter2 rho 95% CI S p
stat_num gcse_num 0.45 (0.06, 0.73) 1418.03 0.022*

p-value adjustment method: Holm (1979) Observations: 25

Task 7.5

In Figure 3.4 we saw some data relating to people’s ratings of dishonest acts and the likeableness of the perpetrator (for a full description see Jane Superbrain Box 3.1). Compute the Spearman correlation between ratings of dishonesty and likeableness of the perpetrator. The data are in honesty_lab.csv.

Load the data using (see Tip 2):

honesty_tib <- discovr::honesty_lab

Estimate the correlation:

honest_r <- correlation(data = honesty_tib, select = c("deed", "likeableness"), method = "spearman")

Table 22 shows the result. The relationship between ratings of dishonesty and likeableness of the perpetrator was significant because the p-value is less than 0.05 (p = 0.000). The value of Spearman’s correlation coefficient is quite large and positive, indicating a large positive effect: the more likeable the perpetrator was, the more positively their dishonest acts were viewed.

We could report the results as:

  • There was a large and significant positive relationship between the likeableness of a perpetrator and how positively their dishonest acts were viewed, \(r_\text{s}\) = 0.84 (0.77, 0.89), p < 0.001.
display(honest_r)
Table 22: Correlation Matrix (spearman-method)
Parameter1 Parameter2 rho 95% CI S p
deed likeableness 0.84 (0.77, 0.89) 26054.24 < .001***

p-value adjustment method: Holm (1979) Observations: 100

Task 7.6

In Chapter 4 (Task 6) we looked at data from people who had fish or cats as pets and measured their life satisfaction and, also, how much they like animals (pets.csv). Is there a significant correlation between life satisfaction and the type of animal the person had as a pet?

Load the data using (see Tip 2):

pet_tib <- discovr::pets

Pet is a categorical variable with two categories (cat or fish). Therefore, we need to look at this relationship using a point-biserial correlation. The code below creates the variable pet_bin within pet_tib by recoding ‘Fish’ to a 0 and ’Cat` to a 1.The correlation coefficients is estimated using these numeric values.

pet_tib <- pet_tib |> 
  mutate(
    pet_bin = ifelse(pet == "Fish", 0, 1),
  )

pet_pb <- correlation(data = pet_tib, select = c("life_satisfaction", "pet_bin"))
display(pet_pb)
Table 23: Correlation Matrix (pearson-method)
Parameter1 Parameter2 r 95% CI t(18) p
life_satisfaction pet_bin 0.63 (0.26, 0.84) 3.45 0.003**

p-value adjustment method: Holm (1979) Observations: 20

I used a two-tailed test because one-tailed tests should never really be used (see book chapter for more explanation). As you can see (Table 23) there was a significant relationship between type of pet and life satisfaction because our p-value is less than 0.05, \(r_\text{pb}\) = 0.63 (0.26, 0.84), t(18) = 3.45, p = 0.003. Looking at how the groups were coded, you should see that fish had a code of 0 and cat had a code of 1, therefore this result reflects the fact that as pet ‘goes up’ one unit (changes from 0 to 1) life satisfaction goes up. Put another way, as the type of pet changes from fish to cat, life satisfaction increases. So, having a cat as a pet was associated with greater life satisfaction than having a fish.

Task 7.7

Repeat the analysis above taking account of animal liking when computing the correlation between life satisfaction and the animal to which a person was married.

We can conduct a partial correlation between life satisfaction and the type of pet while ‘adjusting’ for the effect of liking animals. Remember that we added the variable pet_bin to pet_tib in the previous solution. The results are in Table 24. Notice that the partial correlation between pet and life satisfaction is 0.70, which is greater than the correlation when the effect of animal liking is not adjusted for (r = 0.630 from the previous task). The correlation has become more statistically significant (its p-value has decreased from 0.003 to 0.002).

Running this analysis has shown us that type of pet alone explains a large portion of the variation in life satisfaction. In other words, the relationship between pet and life satisfaction is not due to animal liking.

correlation(data = pet_tib,
            select = c("life_satisfaction", "pet_bin", "animal"),
            partial = TRUE) |> 
  display(digits = 3)
Table 24: Correlation Matrix (pearson-method)
Parameter1 Parameter2 r 95% CI t(18) p
life_satisfaction pet_bin 0.701 (0.375, 0.873) 4.173 0.002**
life_satisfaction animal 0.615 (0.236, 0.831) 3.306 0.008**
pet_bin animal -0.399 (-0.715, 0.053) -1.846 0.081

p-value adjustment method: Holm (1979) Observations: 20

Task 7.8

In Chapter 4 (Task 7) we looked at data based on findings that the number of cups of tea drunk was related to cognitive functioning Feng et al. (2010). The data are in the file tea_15.csv. What is the correlation between tea drinking and cognitive functioning? Is there a significant effect?

Load the data using (see Tip 2):

tea15_tib <- discovr::tea_15

Because the numbers of cups of tea and cognitive function are both interval variables, we can conduct a Pearson’s correlation coefficient.

tea15_r <- correlation(data = tea15_tib, select = c("tea", "cog_fun"))

Table 25 shows the result. The relationship between number of cups of tea drunk per day and cognitive function was not significant, \(r\) = 0.08 (-0.45, 0.57), t(13) = 0.28, p = 0.783. We can tell this because our p-value is greater than 0.05. Note, the correlation itself is small.

display(tea15_r)
Table 25: Correlation Matrix (pearson-method)
Parameter1 Parameter2 r 95% CI t(13) p
tea cog_fun 0.08 (-0.45, 0.57) 0.28 0.783

p-value adjustment method: Holm (1979) Observations: 15

Task 7.9

The research in the previous task was replicated but in a larger sample (N = 716), which is the same as the sample size in Feng et al.’s research (tea_716.csv). Conduct a correlation between tea drinking and cognitive functioning. Compare the correlation coefficient and significance in this large sample, with the previous task. What statistical point do the results illustrate?

Load the data using (see Tip 2):

tea716_tib <- discovr::tea_716

Because the numbers of cups of tea and cognitive function are both interval variables, we can conduct a Pearson’s correlation coefficient.

tea716_r <- correlation(data = tea716_tib, select = c("tea", "cog_fun"))

Table 26 shows the result. We can see that although the value of Pearson’s r has not changed, the relationship between the number of cups of tea drunk per day and cognitive function is now just significant, \(r\) = 0.08 (0.00, 0.15), t(714) = 2.08, p = 0.038 (p is now less than 0.05). This example indicates one of the downfalls of significance testing; you can get significant results when you have large sample sizes even if the effect is very small. Basically, whether you get a significant result or not is entirely subject to the sample size.

display(tea716_r)
Table 26: Correlation Matrix (pearson-method)
Parameter1 Parameter2 r 95% CI t(714) p
tea cog_fun 0.08 (4.41e-03, 0.15) 2.08 0.038*

p-value adjustment method: Holm (1979) Observations: 716

Task 7.10

In Chapter 6 we looked at hygiene scores over three days of a rock music festival (download_festival.csv). Using Spearman’s correlation, were hygiene scores on day 1 of the festival significantly correlated with those on day 3?

Load the data using (see Tip 2):

download_tib <- discovr::download

Estimate the correlation:

download_r <- correlation(data = download_tib, select = c("day_1", "day_3"), method = "spearman")

Table 27 shows the result. The hygiene scores on day 1 of the festival correlated significantly with hygiene scores on day 3, \(r_\text{s}\) = 0.34 (0.17, 0.50), p < 0.001. The value of Spearman’s correlation coefficient is a positive value suggesting that the smellier you are on day 1, the smellier you will be on day 3.

display(honest_r)
Table 27: Correlation Matrix (spearman-method)
Parameter1 Parameter2 rho 95% CI S p
deed likeableness 0.84 (0.77, 0.89) 26054.24 < .001***

p-value adjustment method: Holm (1979) Observations: 100

Task 7.11

Using the data in shopping_exercise.csv (Chapter 4, Task 5), find out if there is a significant relationship between the time spent shopping and the distance covered.

Load the data using (see Tip 2):

shopping_tib <- discovr::shopping

The variables time and distance are both interval. Therefore, we can conduct a Pearson’s correlation. I chose a two-tailed test because it is never really appropriate to conduct a one-tailed test (see the book chapter).

shop_r <- correlation(data = shopping_tib, select = c("distance", "time"))

Table 28 shows that there was a significant positive relationship between time spent shopping and distance covered, \(r\) = 0.83 (0.42, 0.96), t(8) = 4.21, p = 0.003. We can tell that the relationship was significant because the p-value is smaller than 0.05. Also, our value for Pearson’s r is very large.

display(shop_r)
Table 28: Correlation Matrix (pearson-method)
Parameter1 Parameter2 r 95% CI t(8) p
distance time 0.83 (0.42, 0.96) 4.21 0.003**

p-value adjustment method: Holm (1979) Observations: 10

Task 7.12

What effect does accounting for the participant’s biological sex have on the relationship between the time spent shopping and the distance covered?

To answer this question, we need to conduct a partial correlation between the time spent shopping (interval variable) and the distance covered (interval variable) while ‘adjusting’ for the effect of sex (dichotomous variable). First, let’s turn sex into a numeric variable:

shopping_tib <- shopping_tib |> 
  mutate(
    sex_bin = ifelse(sex == "Male", 0, 1),
  )

The data look like this:

shopping_tib |> 
  head(10) |> 
  display()
Table 29: First 10 rows of shopping_tib
sex distance time sex_bin
Male 0.16 15 0
Male 0.40 30 0
Male 1.36 37 0
Male 1.99 65 0
Male 3.61 103 0
Female 1.40 22 1
Female 1.81 140 1
Female 1.96 160 1
Female 3.02 183 1
Female 4.82 245 1

Now lets estimate the partial correlation:

shop_rp <- correlation(data = shopping_tib, select = c("distance", "time", "sex_bin"), partial = TRUE)

Table 30 shows that there was still a significant positive relationship between time spent shopping and distance covered, \(r\) = 0.82 (0.39, 0.96), t(8) = 4.06, p = 0.011. The partial correlation between time and distance is 0.82, which is slightly smaller than the correlation when the effect of sex is not adjusted for (r = 0.83). The correlation has become slightly less statistically significant (its p-value has increased from p = 0.003 to p = 0.011). However, not a lot has changed, so even after adjusting for biological sex, the relationship between distance covered and time spent shopping is very strong.

display(shop_rp)
Table 30: Correlation Matrix (pearson-method)
Parameter1 Parameter2 r 95% CI t(8) p
distance time 0.82 (0.39, 0.96) 4.06 0.011*
distance sex_bin -0.35 (-0.80, 0.36) -1.06 0.320
time sex_bin 0.64 (0.02, 0.91) 2.38 0.089

p-value adjustment method: Holm (1979) Observations: 10

Chapter 8

Task 8.1

In Chapter 7 (Task 9) we looked at data based on findings that the number of cups of tea drunk was related to cognitive functioning Feng et al. (2010). Using a linear model that predicts cognitive functioning from tea drinking, what would cognitive functioning be if someone drank 10 cups of tea? Is there a significant effect? (tea_716.csv)

Load the data using (see Tip 2):

tea716_tib <- discovr::tea_716

Fit the model and display the parameter estimates

tea_lm <- lm(cog_fun ~ tea, data = tea716_tib, na.action = na.exclude)

model_parameters(tea_lm) |> 
  display()
Table 31
Parameter Coefficient SE 95% CI t(714) p
(Intercept) 49.22 0.76 (47.72, 50.72) 64.38 < .001
tea 0.46 0.22 (0.03, 0.89) 2.08 0.038

Looking at Table 31, we can see that we have a model that significantly improves our ability to predict cognitive functioning. The positive beta value (0.46) indicates a positive relationship between number of cups of tea drunk per day and level of cognitive functioning, in that the more tea drunk, the higher your level of cognitive functioning. We can then use the model to predict level of cognitive functioning after drinking 10 cups of tea per day. The first stage is to define the model by replacing the b-values in the equation below with the values from Table 31. In addition, we can replace the X and Y with the variable names so that the model becomes:

\[ \begin{aligned} \text{Cognitive functioning}_i &= b_0 + b_1 \text{Tea drinking}_i \\\\ &= 49.22 +(0.46 \times \text{Tea drinking}_i) \end{aligned} \]

We can predict cognitive functioning, by replacing Tea drinking in the equation with the value 10:

\[ \begin{aligned} \text{Cognitive functioning}_i &= 49.22 +(0.460 \times \text{Tea drinking}_i) \\ &= 49.22 +(0.460 \times 10) \\ &= 53.82 \end{aligned} \]

Therefore, if you drank 10 cups of tea per day, your level of cognitive functioning would be 53.82.

Task 8.2

Estimate a linear model for the pubs.csv data in Jane Superbrain Box 8.1 predicting mortality from the number of pubs. Try repeating the analysis but bootstrapping the confidence intervals.

Load the data using (see Tip 2):

pubs_tib <- discovr::pubs

Fit the model

pub_lm <- lm(mortality ~ pubs, data = pubs_tib, na.action = na.exclude)

Table 33 shows the parameter estimates. Looking at the output, we can see that the number of pubs significantly predicts mortality, \(\hat{b}\) = 14.34 (3.82, 24.86), t(6) = 3.33, p = 0.016. The positive beta value indicates a positive relationship between number of pubs and death rate in that, the more pubs in an area, the higher the rate of mortality (as we would expect). The value of \(R^2\) (Table 32) tells us that number of pubs accounts for 64.9% of the variance in mortality rate – that’s over half!

model_performance(pub_lm) |> 
  display()
Table 32
AIC AICc BIC R2 R2 (adj.) RMSE Sigma
146.9 152.9 147.1 0.65 0.59 1614.64 1864.43
model_parameters(pub_lm) |> 
  display()
Table 33
Parameter Coefficient SE 95% CI t(6) p
(Intercept) 3351.96 781.24 (1440.34, 5263.57) 4.29 0.005
pubs 14.34 4.30 (3.82, 24.86) 3.33 0.016

The bootstrapped confidence intervals (Table 34) are both positive values – they do not cross zero – then assuming this interval is one of the 95% that contain the population value we can gain confidence that there is a positive and non-zero relationship between number of pubs in an area and its mortality rate.

model_parameters(pub_lm, bootstrap = TRUE) |> 
  display()
Table 34
Parameter Coefficient 95% CI p
(Intercept) 2861.83 (-1.93e-12, 4794.23) 0.504
pubs 14.99 (10.72, 100.00) < .001

Task 8.3

In Jane Superbrain Box 2.1 we encountered data (honesty_lab.csv) relating to people’s ratings of dishonest acts and the likeableness of the perpetrator. Run a linear model with bootstrapping to predict ratings of dishonesty from the likeableness of the perpetrator.

Load the data using (see Tip 2):

honesty_tib <- discovr::honesty_lab

Fit the model

honest_lm <- lm(deed ~ likeableness, data = honesty_tib, na.action = na.exclude)

Looking at Table 36 we can see that the likeableness of the perpetrator significantly predicts ratings of dishonest acts, \(\hat{b}\) = 0.94 (0.81, 1.07), t(98) = 14.80, p < 0.001. The positive beta value indicates a positive relationship between likeableness of the perpetrator and ratings of dishonesty, in that, the more likeable the perpetrator, the more positively their dishonest acts were viewed (remember that dishonest acts were measured on a scale from 0 = appalling behaviour to 10 = it’s OK really). The value of \(R^2\) (Table 35) tells us that likeableness of the perpetrator accounts for 69.1% of the variance in the rating of dishonesty, which is over half.

model_performance(honest_lm) |> 
  display()
Table 35
AIC AICc BIC R2 R2 (adj.) RMSE Sigma
324.8 325.1 332.6 0.69 0.69 1.19 1.20
model_parameters(honest_lm) |> 
  display()
Table 36
Parameter Coefficient SE 95% CI t(98) p
(Intercept) -1.86 0.33 (-2.51, -1.21) -5.67 < .001
likeableness 0.94 0.06 (0.81, 1.07) 14.80 < .001

The bootstrapped confidence intervals (Table 37) do not cross zero, then assuming this interval is one of the 95% that contain the population value we can gain confidence that there is a non-zero relationship between the likeableness of the perpetrator and ratings of dishonest acts.

model_parameters(honest_lm, bootstrap = TRUE) |> 
  display()
Table 37
Parameter Coefficient 95% CI p
(Intercept) -1.86 (-2.45, -1.31) < .001
likeableness 0.94 (0.81, 1.06) < .001

Task 8.4

A fashion student was interested in factors that predicted the salaries of catwalk models. She collected data from 231 models (supermodel.csv). For each model she asked them their salary per day (salary), their age (age), their length of experience as models (years) and their industry status as a model as reflected in their percentile position rated by a panel of experts (status). Use a linear model to see which variables predict a model’s salary. How valid is the model?

Load the data using (see Tip 2):

model_tib <- discovr::supermodel

Fit the model:

model_lm <- lm(salary ~ age + years + status, data = model_tib, na.action = na.exclude)

To begin with, a sample size of 231 with three predictors seems reasonable because this would easily detect medium to large effects (see the diagram in the chapter). Overall, the model accounts for 18.4% of the variance in salaries and is a significant fit to the data, F(3, 227) = 17.07, p < 0.001. The adjusted \(R^2\) (0.17) shows some shrinkage from the unadjusted value (0.18), indicating that the model may not generalize well ((Table 38, Table 39).

Table 40 shows the parameter estimates. It seems as though salaries are significantly predicted by the age of the model. This is a positive relationship (look at the sign of the \(\hat{b}\)), indicating that as age increases, salaries increase too. The number of years spent as a model also seems to significantly predict salaries, but this is a negative relationship indicating that the more years you’ve spent as a model, the lower your salary. This finding seems very counter-intuitive, but we’ll come back to it later. Finally, the status of the model doesn’t seem to predict salaries significantly.

model_performance(model_lm) |> 
  display()
Table 38
AIC AICc BIC R2 R2 (adj.) RMSE Sigma
1899.3 1899.5 1916.5 0.18 0.17 14.45 14.57
test_wald(model_lm) |> 
  display()
Table 39
Name Model df df_diff F p
Null model lm 230
Full model lm 227 3 17.07 < .001

Models were detected as nested (in terms of fixed parameters) and are compared in sequential order.

model_parameters(model_lm) |> 
  display()
Table 40
Parameter Coefficient SE 95% CI t(227) p
(Intercept) -60.89 16.50 (-93.40, -28.38) -3.69 < .001
age 6.23 1.41 (3.45, 9.02) 4.42 < .001
years -5.56 2.12 (-9.74, -1.38) -2.62 0.009
status -0.20 0.15 (-0.50, 0.10) -1.29 0.199

The next part of the question asks whether this model is valid. Figure 20 shows the model residuals.

  • The scatterplots of fitted values vs residuals do not show a random pattern. There is a distinct funnelling, and the trend line is not flat, indicating heteroscedasticity.
  • The normal Q-Q plot shows a highly non-normal distribution of residuals because the dots deviate considerably from the straight line (which indicates what you’d get from normally distributed residuals).
  • For the age and experience variables in the model, VIF values are above 10, indicating multicollinearity in the data. In fact, the correlation between these two variables is around .9! So, these two variables are measuring very similar things. Of course, this makes perfect sense because the older a model is, the more years she would’ve spent modelling! So, it was fairly daft to measure both of these things! This also explains the weird result that the number of years spent modelling negatively predicted salary (i.e. more experience = less salary!): in fact if you fit a model with experience as the only predictor of salary you’ll find it has the expected positive relationship. This hopefully demonstrates why multicollinearity can bias the regression model.

All in all, several assumptions have not been met and so this model is probably fairly unreliable.

check_model(model_lm)
Figure 20

Task 8.5

A study was carried out to explore the relationship between aggression and several potential predicting factors in 666 children who had an older sibling. Variables measured were parenting_style (high score = bad parenting practices), computer_games (high score = more time spent playing computer games), television (high score = more time spent watching television), diet (high score = the child has a good diet low in harmful additives) and sibling_aggression (high score = more aggression seen in their older sibling). Past research indicated that parenting style and sibling aggression were good predictors of the level of aggression in the younger child. All other variables were treated in an exploratory fashion. Analyse them with a linear model (child_aggression.csv).

Load the data using (see Tip 2):

aggress_tib <- discovr::child_aggression

Fit the models. We need to conduct this analysis hierarchically, entering parenting_style and sibling_aggression in the first step and the remaining variables in a second step.

agress_lm <- lm(aggression ~ parenting_style + sibling_aggression, data = aggress_tib, na.action = na.exclude)
agress_full_lm <- update(agress_lm, .~. + television + computer_games + diet)

The second model is a significantly better fit than the first, F(3, 66) = 7.03, p < 0.001, so we’ll retain that model (Table 41). \(R^2\) is the squared correlation between the observed values of aggression and the values of aggression predicted by the model (Table 41). With all four of these predictors in the model only 8.3% of the variance in aggression can be explained.

test_wald(aggress_lm, aggress_full_lm) |> 
  display()
Table 41
Name Model df df_diff F p
aggress_lm lm 663
aggress_full_lm lm 660 3 7.03 < .001

Models were detected as nested (in terms of fixed parameters) and are compared in sequential order.

model_performance(aggress_full_lm) |> 
  display()
Table 42
AIC AICc BIC R2 R2 (adj.) RMSE Sigma
325.4 325.6 356.9 0.08 0.08 0.31 0.31

Table 43 shows the parameter estimates. Based on the final model (which is actually all we’re interested in) the following variables predict aggression:

  • Parenting style significantly predicted aggression, \(\hat{b}\) = 0.06 (0.03, 0.09), t(66) = 3.89, p < 0.001. The beta value indicates that as parenting increases (i.e. as bad practices increase), aggression increases also.
  • Sibling aggression significantly predicted aggression, \(\hat{b}\) = 0.08 (0.01, 0.16), t(66) = 2.11, p = 0.036. The beta value indicates that as sibling aggression increases (became more aggressive), aggression increases also.
  • Television did not significantly predict aggression, \(\hat{b}\) = 0.03 (-0.06, 0.12), t(66) = 0.71, p = 0.475.
  • Computer games significantly predicted aggression, \(\hat{b}\) = 0.14 (0.07, 0.21), t(66) = 3.85, p < 0.001. The beta value indicates that as the time spent playing computer games increases, aggression increases also.
  • Good diet significantly predicted aggression, \(\hat{b}\) = -0.11 (-0.18, -0.03), t(66) = -2.86, p = 0.004. The beta value indicates that as the diet improved, aggression decreased.
model_parameters(agress_full_lm) |>
  display()
Table 43
Parameter Coefficient SE 95% CI t(660) p
(Intercept) -4.99e-03 0.01 (-0.03, 0.02) -0.42 0.677
parenting style 0.06 0.01 (0.03, 0.09) 3.89 < .001
sibling aggression 0.08 0.04 (5.54e-03, 0.16) 2.11 0.036
television 0.03 0.05 (-0.06, 0.12) 0.71 0.475
computer games 0.14 0.04 (0.07, 0.21) 3.85 < .001
diet -0.11 0.04 (-0.18, -0.03) -2.86 0.004

Figure 21 shows the residual plots.

  • The scatterplot of predicted values vs residuals helps us to assess both homoscedasticity and independence of errors. The plot does show a random scatter, but the trend line is not completely flat. It would be useful to run a robust model. Using robust standard errors, sibling aggression is no longer significant (Table 44).
  • The Q-Q plot suggests that errors may well deviate from normally distributed.
  • The influence plot looks fine. There are no extreme Cook’s distances and the trend line is fairly flat.
check_model(agress_full_lm)
Figure 21
model_parameters(agress_full_lm, vcov = "HC4") |>
  display()
Table 44
Parameter Coefficient SE 95% CI t(660) p
(Intercept) -4.99e-03 0.01 (-0.03, 0.02) -0.42 0.675
parenting style 0.06 0.02 (0.02, 0.09) 2.93 0.003
sibling aggression 0.08 0.05 (-7.26e-03, 0.17) 1.80 0.072
television 0.03 0.06 (-0.08, 0.15) 0.55 0.581
computer games 0.14 0.05 (0.05, 0.24) 2.96 0.003
diet -0.11 0.05 (-0.21, -0.01) -2.19 0.029

Task 8.6

Repeat the analysis in Labcoat Leni’s Real Research 8.1 using bootstrapping for the confidence intervals. What are the confidence intervals for the regression parameters?

To recap, the models are fit as follows (see also the Labcoat Leni answers). Load the data using (see Tip 2):

ong_tib <- discovr::ong_2011

Facebook status update frequency

Fit the models that look at whether narcissism predicts, above and beyond the other variables, the frequency of status updates.

ong_base_lm <- lm(status ~ sex + age + grade, data = ong_tib, na.action = na.exclude)
ong_ext_lm <- update(ong_base_lm, .~. + extraversion)
ong_ncs_lm <- update(ong_ext_lm, .~. + narcissism)

Table 45 compares the models and shows that both adding extraversion and narcissism significantly improve the fit of the model.

test_wald(ong_base_lm, ong_ext_lm, ong_ncs_lm) |>
  display()
Table 45
Name Model df df_diff F p
ong_base_lm lm 246
ong_ext_lm lm 245 1 4.02 0.046
ong_ncs_lm lm 244 1 8.98 0.003

Models were detected as nested (in terms of fixed parameters) and are compared in sequential order.

Table 47 shows the parameter estimates and bootstrap confidence intervals and significance tests. The main benefit of the bootstrap confidence intervals and significance values is that they do not rely on assumptions of normality or homoscedasticity, so they give us an accurate estimate of the true population value of b for each predictor. The bootstrapped confidence intervals in the output do not affect the conclusions reported in Ong et al. (2011). Ong et al.’s prediction was still supported in that, after controlling for age, grade and sex, narcissism significantly predicted the frequency of Facebook status updates over and above extroversion, \(\hat{b}\) = 0.07 (0.02, 0.11), p < 0.001.

Table 46
model_parameters(model = ong_ncs_lm, bootstrap = TRUE) |> 
  display()
Table 47
Parameter Coefficient 95% CI p
(Intercept) 0.03 (-5.77, 5.59) 0.994
sex (Male) -0.96 (-1.60, -0.36) < .001
age -0.01 (-0.34, 0.39) 0.924
grade (Sec 2) -0.42 (-1.30, 0.55) 0.384
grade (Sec 3) -1.04 (-2.16, -2.20e-03) 0.050
extraversion 0.01 (-0.05, 0.07) 0.650
narcissism 0.07 (0.02, 0.11) < .001

Facebook profile picture rating

Fit the models that look at whether narcissism predicts, above and beyond the other variables, the Facebook profile picture ratings.

prof_base_lm <- lm(profile ~ sex + age + grade, data = ong_tib, na.action = na.exclude)
prof_ext_lm <- update(prof_base_lm, .~. + extraversion)
prof_ncs_lm <- update(prof_ext_lm, .~. + narcissism)

Table 48 compares the models and shows that both adding extraversion and narcissism significantly improve the fit of the model.

test_wald(prof_base_lm, prof_ext_lm, prof_ncs_lm) |>
  display()
Table 48
Name Model df df_diff F p
prof_base_lm lm 188
prof_ext_lm lm 187 1 29.61 < .001
prof_ncs_lm lm 186 1 21.60 < .001

Models were detected as nested (in terms of fixed parameters) and are compared in sequential order.

Table 50 shows the parameter estimates and bootstrap confidence intervals and significance tests. The bootstrapped confidence intervals for the final model are consistent with the conclusions reported in Ong et al. (2011). That is, after adjusting for age, grade and sex, narcissism significantly predicted the Facebook profile picture ratings over and above extroversion, \(\hat{b}\) = 0.17 (0.10, 0.24), p < 0.001.

Table 49
model_parameters(model = prof_ncs_lm, bootstrap = TRUE) |> 
  display()
Table 50
Parameter Coefficient 95% CI p
(Intercept) -3.79 (-19.80, 8.67) 0.612
sex (Male) 0.64 (-0.53, 1.90) 0.284
age 0.35 (-0.52, 1.45) 0.480
grade (Sec 2) -0.56 (-2.21, 1.00) 0.472
grade (Sec 3) -0.59 (-2.94, 1.50) 0.572
extraversion 0.11 (0.02, 0.21) 0.014
narcissism 0.17 (0.10, 0.24) < .001

Task 8.7

Coldwell et al. (2006) investigated whether household chaos predicted children’s problem behaviour over and above parenting. From 118 families they recorded the age and gender of the youngest child (child_age and child_gender). They measured dimensions of the child’s perceived relationship with their mum: (1) warmth/enjoyment (child_warmth) and (2) anger/hostility (child_anger). Higher scores indicate more warmth/enjoyment and anger/hostility, respectively. They measured the mum’s perceived relationship with her child, resulting in dimensions of positivity (mum_pos) and negativity (mum_neg). Household chaos (chaos) was assessed. The outcome variable was the child’s adjustment (sdq): the higher the score, the more problem behaviour the child was reported to be displaying. Conduct a hierarchical linear model in three steps: (1) enter child age and gender; (2) add the variables measuring parent–child positivity, parent–child negativity, parent–child warmth, parent–child anger; and (3) add chaos. Is household chaos predictive of children’s problem behaviour over and above parenting? (coldwell_2006.csv).

Load the data using (see Tip 2):

chaos_tib <- discovr::coldwell_2006

Fit the models

# Model 1: sdq predicted from child_age and child_gender:
chaos_base_lm <- lm(sdq ~ child_age + child_age, data = chaos_tib, na.action = na.exclude)

# Model 2: In a new block, add child_anger, child_warmth, mum_pos and mum_neg into the model:
chaos_emo_lm <- update(chaos_base_lm, .~. + child_anger + child_warmth + mum_pos + mum_neg)

# Model 3: In a final block, add chaos to the model:
chaos_chaos_lm <- update(chaos_emo_lm, .~. + chaos)

Table 51 shows the fit statistics. Note that the \(R^2\) is increasing as new predictors are added (althopugh the variance explained is small).

WarningModel comparison

We can’t compare the models using test_wald() because they’re based on different amounts of data because of missing values.

compare_performance(chaos_base_lm, chaos_emo_lm, chaos_chaos_lm) |> 
  display()
Table 51
Name Model R2 R2 (adj.) RMSE Sigma
chaos_base_lm lm 9.04e-04 -8.61e-03 0.17 0.17
chaos_emo_lm lm 0.06 8.31e-03 0.17 0.17
chaos_chaos_lm lm 0.10 0.04 0.16 0.17

Table 52 shows the parameter estimates. (Note that in the paper they report standardized betas so I have used standardize = "refit".) Household chaos significantly predicted younger sibling’s problem behaviour over and above maternal parenting, child age and gender, \(\hat{\beta}\) = 0.22 (0.01, 0.42), t(89) = 2.08, p = 0.040. The positive standardized beta value indicates that there is a positive relationship between household chaos and child’s problem behaviour. In other words, the higher the level of household chaos, the more problem behaviours the child displayed. The change in \(R^2\) (0.104 - 0.061 = 0.044) in the model with and without chaos as a predictor tells us that that household chaos accounts for 4.4% of the variance in child problem behaviour.

model_parameters(chaos_chaos_lm, standardize = "refit") |> 
  display()
Table 52
Parameter Coefficient SE 95% CI t(89) p
(Intercept) 2.42e-16 0.10 (-0.20, 0.20) 2.42e-15 > .999
child age 0.07 0.10 (-0.14, 0.27) 0.66 0.514
child anger 0.02 0.10 (-0.19, 0.23) 0.19 0.850
child warmth -0.01 0.11 (-0.23, 0.21) -0.09 0.926
mum pos 0.05 0.10 (-0.15, 0.26) 0.51 0.613
mum neg 0.20 0.11 (-0.01, 0.41) 1.86 0.066
chaos 0.22 0.10 (9.90e-03, 0.42) 2.08 0.040

Chapter 9

Warning

I had a massive brain fart with the Smart Alex questions in this chapter. Although I adapted the tasks from the IBM SPSS version of this book, for some tasks this adapted text didn’t find its way into the final manuscript. What Can I say, it was an intense time. The Tasks below reflect what should have been in the book (see also the errata page)

Task 9.1

Is arachnophobia (fear of spiders) specific to real spiders or will pictures of spiders evoke similar levels of anxiety? Twelve arachnophobes were asked to play with a big hairy tarantula with big fangs and an evil look in its eight eyes, and at a different point in time were shown only pictures of the same spider. The participants’ anxiety was measured in each case. Do a t-test to see whether anxiety is higher for real spiders than pictures (big_hairy_spider.csv).

Load the data using (see Tip 2):

racnoss_tib <- discovr::big_hairy_spider

Let’s get the summary statistics in Table 53.

racnoss_tib |>
  group_by(spider_type) |>
  describe_distribution(ci = 0.95, iterations = 1000) |>
  display()
Table 53: Summary statistics
spider_type Variable Mean 95% CI (Mean) SD IQR Range Skewness Kurtosis n n_Missing
Picture anxiety 40 (35.00, 45.00) 9.29 17.50 (25.00, 55.00) 0.00 -1.00 12 0
Real anxiety 47 (41.00, 52.84) 11.03 19.00 (30.00, 65.00) -7.32e-03 -1.12 12 0

We need to run a paired samples t-test on these data, and to do so we need the data in messy format as in Table 54:

# Convert to messy
messy_racnoss_tib <- racnoss_tib |>
  pivot_wider(
    id_cols = id,
    values_from = anxiety,
    names_from = spider_type
    ) |>
  rename_with(.data = _, .fn = ~tolower(.x))

# show the data
DT::datatable(messy_racnoss_tib)
Table 54: Data in messy format

Fit the model

racnoss_mod <- t.test(Pair(real, picture) ~ 1, data = messy_racnoss_tib)

Table 55 shows the parameter estimates.

model_parameters(racnoss_mod) |> 
  display()
Table 55: Paired t-test
Parameter Difference t(11) p 95% CI
Pair(real, picture) 7 2.47 0.031 (0.77, 13.23)

Alternative hypothesis: true mean difference is not equal to 0

Table 56 shows the effect size estimate.

repeated_measures_d(Pair(real, picture) ~ 1,
                    data = messy_racnoss_tib,
                    method = "rm") |>
  display()
Table 56
d (rm) 95% CI
0.63 [0.08, 1.19]

Adjusted for small sample bias.

Important Write it up!

On average, participants experienced significantly greater anxiety with real spiders (M = 47.00, SD = 11.03) than with pictures of spiders (M = 40.00, SD = 9.29), \(M_\text{diff}\) = 7.00 (0.77, 13.23), t(11) = 2.47, p = 0.031, \(\hat{d}\) = 0.63 [0.08, 1.19].

Task 9.2

Plot an error bar graph of the data in Task 1

Figure 22 shows an error bar plot.

ggplot(racnoss_tib, aes(spider_type, anxiety)) +
  geom_point(colour = "#46F884FF", position = position_jitter(width = 0.1)) +
  stat_summary(fun.data = "mean_cl_normal", colour = "#3E9BFEFF") +
  coord_cartesian(ylim = c(0, 60)) +
  scale_y_continuous(breaks = seq(0, 60, 5)) +
  labs(x = "Type of spider", y = "Anxiety") +
  theme_minimal()
Figure 22

Task 9.3

‘Pop psychology’ books sometimes spout nonsense that is unsubstantiated by science. As part of my plan to rid the world of pop psychology, I took 20 people in relationships and randomly assigned them to one of two groups. One group read the famous popular psychology book Women are from Bras and Men are from Penis, and the other read Marie Claire. The outcome variable was their relationship happiness after their assigned reading. Were people happier with their relationship after reading the pop psychology book? (self_help.csv).

Load the data using (see Tip 2):

selfhelp_tib <- discovr::self_help

Let’s get the summary statistics in Table 57.

selfhelp_tib |>
  group_by(book) |>
  describe_distribution(ci = 0.95, iterations = 1000) |>
  display()
Table 57: Summary statistics
book Variable Mean 95% CI (Mean) SD IQR Range Skewness Kurtosis n n_Missing
Women are from Bras, Men are from Penis happy 20.00 (17.60, 22.60) 4.11 4.75 (13.00, 28.00) 0.28 0.93 10 0
Marie Claire happy 24.20 (21.20, 26.90) 4.71 8.00 (15.00, 30.00) -0.77 -0.14 10 0

Fit the model

book_mod <- t.test(happy ~ book, data = selfhelp_tib)

Table 58 shows the parameter estimates.

model_parameters(book_mod) |> 
  display()
Table 58: Welch Two Sample t-test
Parameter Group book = Women are from Bras, Men are from Penis book = Marie Claire Difference 95% CI t(17.68) p
happy book 20 24.20 -4.20 (-8.36, -0.04) -2.12 0.048

Alternative hypothesis: true difference in means between group Women are from Bras, Men are from Penis and group Marie Claire is not equal to 0

Table 59 shows the effect size estimate.

cohens_d(happy ~ book, data = selfhelp_tib) |>
  display()
Table 59
Cohen’s d 95% CI
-0.95 [-1.87, -0.01]

Estimated using pooled SD.

Important Write it up!

On average, the reported relationship happiness after reading Marie Claire (M = 24.20, SD = 4.71), was significantly higher than after reading Women are from bras and men are from penis (M = 20.00, SD = 4.11), \(M_\text{diff}\) = -4.20 (-8.36, -0.04), t(17.68) = -2.12, p = 0.048, \(\hat{d}\) = -0.95 [-1.87, -0.01].

Task 9.4

Twaddle and Sons, the publishers of Women are from Bras and Men are from Penis, were upset about my claims that their book was as useful as a paper umbrella. They ran their own experiment (N = 500) in which relationship happiness was measured after participants had read their book and after reading this one. (Participants read the books in counterbalanced order with a sixmonth delay.) Was relationship happiness greater after reading their wonderful contribution to pop psychology than after reading my tedious tome about experiments? (self_help_dsur.csv).

Load the data using (see Tip 2):

self_help_dsur_tib <- discovr::self_help_dsur

Let’s get the summary statistics in Table 57.

self_help_dsur_tib |>
  group_by(book) |>
  describe_distribution(ci = 0.95, iterations = 1000) |>
  display()
Table 60: Summary statistics
book Variable Mean 95% CI (Mean) SD IQR Range Skewness Kurtosis n n_Missing
Self-help book happiness 20.02 (19.19, 20.96) 9.98 12 (-18.00, 45.00) -0.07 0.48 500 0
Discovering statistics using R happiness 18.49 (17.73, 19.21) 8.99 12 (-8.00, 42.00) 0.06 -0.12 500 0

We need to run a paired samples t-test on these data, and to do so we need the data in messy format as in Table 61:

# Convert to messy
messy_book_tib <- self_help_dsur_tib |>
  pivot_wider(
    id_cols = id,
    values_from = happiness,
    names_from = book
    ) |>
  rename(
    self_help = `Self-help book`,
    dsur = `Discovering statistics using R`
  )

# show the data
DT::datatable(messy_book_tib)
Table 61: Data in messy format

Fit the model

bookr_mod <- t.test(Pair(self_help, dsur) ~ 1, data = messy_book_tib)

Table 62 shows the parameter estimates.

model_parameters(bookr_mod) |> 
  display()
Table 62: Paired t-test
Parameter Difference t(499) p 95% CI
Pair(self_help, dsur) 1.53 2.71 0.007 (0.42, 2.64)

Alternative hypothesis: true mean difference is not equal to 0

Table 63 shows the effect size estimate.

repeated_measures_d(Pair(self_help, dsur) ~ 1,
                    data = messy_book_tib,
                    method = "rm") |>
  display()
Table 63
d (rm) 95% CI
0.16 [0.04, 0.28]

Adjusted for small sample bias.

Although this effect is highly statistically significant, the size of the effect is small. In this example, it would be tempting for Twaddle and Sons to conclude that their book produced significantly greater relationship happiness than my book. In fact, many researchers would write conclusions like this:

Important Write it up!

On average, the reported relationship happiness after reading Field (2026) (M = 18.49, SD = 8.99), was significantly lower than after reading Women are from bras and men are from penis (M = 20.02, SD = 9.98), \(M_\text{diff}\) = 1.53 (0.42, 2.64), t(499) = 2.71, p = 0.007. In other words, reading Women are from bras and men are from penis produces significantly greater relationship happiness than that book by smelly old Field.

However, to reach such a conclusion is to confuse statistical significance with the importance of the effect. By calculating the effect size we’ve discovered that although the difference in happiness after reading the two books is statistically different, the size of effect that this represents is small. A more balanced interpretation might, therefore, be:

Important Write it up!

On average, the reported relationship happiness after reading Field (2026) (M = 18.49, SD = 8.99), was significantly lower than after reading Women are from bras and men are from penis (M = 20.02, SD = 9.98), \(M_\text{diff}\) = 1.53 (0.42, 2.64), t(499) = 2.71, p = 0.007, \(\hat{d}\) = 0.16 [0.04, 0.28]. However, the effect size was small, revealing that this finding was not substantial in real terms.

Of course, this latter interpretation would be unpopular with Twaddle and Sons who would like to believe that their book had a huge effect on relationship happiness.

Task 9.5

In Chapter 4 (Task 6) we looked at data from people who had fish or cats as pets and measured their life satisfaction as well as how much they like animals (pets.csv). Conduct a t-test to see whether life satisfaction depends upon the type of pet a person has.

Load the data using (see Tip 2):

pet_tib <- discovr::pets

Let’s get the summary statistics in Table 64.

pet_tib |>
  group_by(pet) |>
  describe_distribution(select = "life_satisfaction", ci = 0.95, iterations = 1000) |>
  display()
Table 64: Summary statistics
pet Variable Mean 95% CI (Mean) SD IQR Range Skewness Kurtosis n n_Missing
Fish life_satisfaction 38.17 (28.67, 45.84) 15.51 19.25 (6.00, 56.00) -1.17 0.44 12 0
Cat life_satisfaction 60.12 (52.62, 66.13) 11.10 13.50 (37.00, 72.00) -1.45 2.23 8 0

Fit the model

pet_mod <- t.test(life_satisfaction ~ pet, data = pet_tib)

Table 65 shows the parameter estimates.

model_parameters(pet_mod) |> 
  display()
Table 65: Welch Two Sample t-test
Parameter Group pet = Fish pet = Cat Difference 95% CI t(17.84) p
life_satisfaction pet 38.17 60.12 -21.96 (-34.48, -9.44) -3.69 0.002

Alternative hypothesis: true difference in means between group Fish and group Cat is not equal to 0

Table 66 shows the effect size estimate.

cohens_d(life_satisfaction ~ pet, data = pet_tib) |>
  display()
Table 66
Cohen’s d 95% CI
-1.57 [-2.59, -0.53]

Estimated using pooled SD.

As well as being statistically significant, this effect is very large and so represents a substantive finding. We could report:

Important Write it up!

On average, the life satisfaction of people with cats (M = 60.12, SD = 11.10), was significantly higher than for those with fish (M = 38.17, SD = 15.51), \(M_\text{diff}\) = -21.96 (-34.48, -9.44), t(17.84) = -3.69, p = 0.002, \(\hat{d}\) = -1.57 [-2.59, -0.53].

Task 9.6

Repeat the t-test from Task 5 but turn of Welch’s correction. Then fit a linear model to the data in Task 5 to see whether life satisfaction is significantly predicted from the type of animal. What do you notice about the t-value and significance in this model compared to Task 5?

First let’s fit the model without applying the default Welch correction by including var.equal = TRUE. The results are in Table 67.

pet_mod <- t.test(life_satisfaction ~ pet,
                  data = pet_tib,
                  var.equal = TRUE)

model_parameters(pet_mod) |> 
  display()
Table 67: Two Sample t-test
Parameter Group pet = Fish pet = Cat Difference 95% CI t(18) p
life_satisfaction pet 38.17 60.12 -21.96 (-35.35, -8.57) -3.45 0.003

Alternative hypothesis: true difference in means between group Fish and group Cat is not equal to 0

Now let’s fit a linear model. The results are in Table 68.

pet_lm <- lm(life_satisfaction ~ pet,
                  data = pet_tib)

model_parameters(pet_lm) |> 
  display()
Table 68
Parameter Coefficient SE 95% CI t(18) p
(Intercept) 38.17 4.03 (29.70, 46.63) 9.47 < .001
pet (Cat) 21.96 6.37 (8.57, 35.35) 3.45 0.003

The values of t and p are the same. (Technically, t is different because for the linear model it is a positive value and for the t-test it is negative However, the sign of t merely reflects which way around you coded the cat and fish groups. The linear model, by default, has coded the groups the opposite way around to the t-test.) The main point I wanted to make here is that whether you run these data through the lm() or t.test() functions, the results are identical.

Task 9.7

In Chapter 6 we looked at hygiene scores over three days of a rock music festival (download.csv). Do a paired-samples t-test to see whether hygiene scores on day 1 differed from those on day 3.

Load the data using (see Tip 2):

download_tib <- discovr::download

Let’s get the summary statistics in Table 69.

describe_distribution(x = download_tib,
                      select = c("day_1", "day_3"),
                      ci = 0.95, iterations = 1000) |>
  display()
Table 69: Summary statistics
Variable Mean 95% CI (Mean) SD IQR Range Skewness Kurtosis n n_Missing
day_1 1.79 (1.73, 1.86) 0.94 0.93 (0.02, 20.02) 8.87 170.45 810 0
day_3 0.98 (0.86, 1.10) 0.71 1.11 (0.02, 3.41) 1.03 0.73 123 687

We need to run a paired samples t-test on these data, and so the data need to be in messy format, which they are.

Fit the model:

download_mod <- t.test(Pair(day_1, day_3) ~ 1, data = download_tib)

Table 70 shows the parameter estimates.

model_parameters(download_mod) |> 
  display()
Table 70: Paired t-test
Parameter Difference t(122) p 95% CI
Pair(day_1, day_3) 0.67 10.59 < .001 (0.55, 0.80)

Alternative hypothesis: true mean difference is not equal to 0

Table 71 shows the effect size estimate.

repeated_measures_d(Pair(day_1, day_3) ~ 1,
                    data = download_tib,
                    method = "rm") |>
  display()
Table 71
d (rm) 95% CI
0.99 [0.76, 1.21]

Adjusted for small sample bias.

Important Write it up!

On average, hygiene scores significantly decreased from day 1 (M = 1.79, SD = 0.94) to day 3 (M = 0.98, SD = 0.71) of the Download music festival, \(M_\text{diff}\) = 0.67 (0.55, 0.80), t(122) = 10.59, p < 0.001, \(\hat{d}\) = 0.99 [0.76, 1.21].

Task 9.8

A psychologist was interested in the cross-species differences between men and dogs. She observed a group of dogs and a group of men in a naturalistic setting (20 of each). She classified several behaviours as being dog-like (urinating against trees and lampposts, attempts to copulate with anything that moved, and attempts to lick their own genitals). For each man and dog she counted the number of dog-like behaviours displayed in a 24-hour period. It was hypothesized that dogs would display more dog-like behaviours than men. Test this hypothesis using a robust test. (men_dogs.csv)

Load the data using (see Tip 2):

mandog_tib <- discovr::men_dogs

Let’s get the summary statistics in Table 72.

mandog_tib |>
  group_by(species) |>
  describe_distribution(select = "behaviour", ci = 0.95, iterations = 1000) |>
  display()
Table 72: Summary statistics
species Variable Mean 95% CI (Mean) SD IQR Range Skewness Kurtosis n n_Missing
Dog behaviour 28.05 (23.55, 33.00) 10.98 17.25 (15.00, 51.00) 0.74 -0.57 20 0
Man behaviour 26.85 (22.95, 31.65) 9.90 11.25 (12.00, 54.00) 0.94 1.66 20 0

Fit the model

mandog_mod <- t.test(behaviour ~ species, data = mandog_tib)

Table 73 shows the parameter estimates.

model_parameters(mandog_mod) |> 
  display()
Table 73: Welch Two Sample t-test
Parameter Group species = Dog species = Man Difference 95% CI t(37.60) p
behaviour species 28.05 26.85 1.20 (-5.50, 7.90) 0.36 0.719

Alternative hypothesis: true difference in means between group Dog and group Man is not equal to 0

Table 74 shows the effect size estimate.

cohens_d(behaviour ~ species, data = mandog_tib) |>
  display()
Table 74
Cohen’s d 95% CI
0.11 [-0.51, 0.73]

Estimated using pooled SD.

The effect is not statistically significant, and is a small effect. We could report:

Important Write it up!

On average, men (M = 26.85, SD = 9.90) engaged in less dog-like behaviour than dogs (M = 28.05, SD = 10.98). However, this difference was not significant, \(M_\text{diff}\) = 1.20 (-5.50, 7.90), t(37.6) = 0.36, p = 0.719, and was a small effect, \(\hat{d}\) = 0.11 [-0.51, 0.73].

Task 9.9

Both Ozzy Osbourne and Judas Priest have been accused of putting backward masked messages on their albums that subliminally influence poor unsuspecting teenagers into doing things like blowing their heads off with shotguns. A psychologist was interested in whether backward masked messages could have an effect. He created a version of Taylor Swifts’ ‘Shake it off’ that contained the masked message ‘deliver your soul to the dark lord’ repeated in the chorus. He took this version, and the original, and played one version (randomly) to a group of 32 people. Six months later he played them whatever version they hadn’t heard the time before. So, each person heard both the original and the version with the masked message, but at different points in time. The psychologist measured the number of satanic intrusions the person had in the week after listening to each version. Test the hypothesis that the backward message would lead to more intrusions (dark_lord.csv) using a robust test.

Load the data using (see Tip 2):

darklord_tib <- discovr::dark_lord

Let’s get the summary statistics in Table 75.

darklord_tib |>
  group_by(message) |>
  describe_distribution(select = "intrusions", ci = 0.95, iterations = 1000) |>
  display()
Table 75: Summary statistics
message Variable Mean 95% CI (Mean) SD IQR Range Skewness Kurtosis n n_Missing
Message intrusions 9.16 (7.91, 10.38) 3.55 4 (2.00, 17.00) 0.08 0.24 32 0
No message intrusions 11.50 (10.09, 13.03) 4.38 3 (4.00, 21.00) 0.79 0.23 32 0

We need to run a robust paired samples t-test on these data. Therefore, we need to make the data messy (as in Table 76):

# Convert to messy
dark_mess_tib <- darklord_tib |>
  pivot_wider(
    id_cols = id,
    values_from = intrusions,
    names_from = message
    ) |>
  rename_with(.data = _, ~ tolower(str_replace(.x, pattern = " ", replacement = "_")))

# show the data
DT::datatable(dark_mess_tib)
Table 76: Data in messy format

Fit the robust model

darklord_rob <- WRS2::yuend(x = dark_mess_tib$message, y = dark_mess_tib$no_message)

Table 77 shows the parameter estimates.

model_parameters(darklord_rob) |> 
  display()
Table 77: Yuen’s test on trimmed means for dependent samples
t(19) p Difference Difference 95% CI Estimate Effectsize
-1.96 0.064 -1.55 (-3.20, 0.10) 0.36 Explanatory measure of effect size
Important Write it up!

Participants had fewer Satanic intrusions after hearing the backward message (M = 9.16, SD = 3.55), than after hearing the normal version of the Taylor Swift song (M = 11.50, SD = 4.38). This difference was not significant, \(M_\text{diff}\) = -1.55 (-3.20, 0.10), Y_t(19) = -1.96, p = 0.064. However, it represented a substantial effect, d = 0.36.

Task 9.10

Thinking back to Labcoat Leni’s Real Research 4.1, test whether the number of offers was significantly different in people listening to Bon Scott than in those listening to Brian Johnson (acdc.csv) using a robust independent t-test and bootstrapping. Do your results differ from Oxoby (2008)?

Load the data using (see Tip 2):

acdc_tib <- discovr::acdc

Let’s get the summary statistics in Table 78.

acdc_tib |>
  group_by(singer) |>
  describe_distribution(select = "offer", ci = 0.95, iterations = 1000) |>
  display()
Table 78: Summary statistics
singer Variable Mean 95% CI (Mean) SD IQR Range Skewness Kurtosis n n_Missing
Bon Scott offer 3.28 (2.72, 3.78) 1.18 2 (1.00, 5.00) -0.12 -0.77 18 0
Brian Johnson offer 4.00 (3.56, 4.50) 0.97 2 (2.00, 5.00) -0.43 -0.96 18 0

Fit the model

acdc_rob <- WRS2::yuenbt(offer ~ singer, data = acdc_tib, nboot = 1000, side = TRUE)

Table 80 shows the parameter estimates.

Table 79
model_parameters(acdc_rob) |> 
  display()
display(acdc_pe)
Table 80: Yuen’s test on trimmed means for independent samples
t p Difference Difference 95% CI
-1.73 0.086 -0.83 (-1.79, 0.12)

Table 81 shows the effect size estimate.

cohens_d(offer ~ singer, data = acdc_tib) |>
  display()
Table 81
Cohen’s d 95% CI
-0.67 [-1.34, 0.01]

Estimated using pooled SD.

The bootstrap confidence interval ranged from -1.79 to 0.12, which just about crosses zero suggesting that (if we assume that it is one of the 95% of confidence intervals that contain the true value) that the effect in the population could be zero.

Important Write it up!

On average, more offers were made when listening to Brian Johnson (M = 4.00, SD = 0.97) than Bon Scott (M = 3.28, SD = 1.18). This difference, was not significant, \(M_\text{diff}\) = -0.83 (-1.79, 0.12), Y_t = -1.73, p = 0.086, \(\hat{d}\) = -0.67 [-1.34, 0.01].

Chapter 10

Task 10.1

McNulty et al. (2008) found a relationship between a person’s attractiveness and how much support they give their partner among newlywed heterosexual couples. The data are in mcnulty_2008.csv, Is this relationship moderated by spouse (i.e., whether the data were from the husband or wife)?

Load the data using (see Tip 2):

mcnulty_tib <- discovr::mcnulty_2008

Centre the predictors

mcnulty_ctib <- center(x = mcnulty_tib, select = c(attractiveness , support, satisfaction))

Fit the model

support_lm <- lm(support ~ attractiveness*spouse, data = mcnulty_ctib)

Table 82 shows the fit statistics

model_performance(support_lm) |> 
  display()
Table 82
AIC AICc BIC R2 R2 (adj.) RMSE Sigma
-52.2 -51.9 -36.7 0.09 0.07 0.20 0.20

Evaluate the model assumptions in Figure 23. There’s potential heteroscedasticity.

check_model(support_lm)
Figure 23

Table 83 shows the robust parameter estimates

model_parameters(support_lm, vcov = "HC4") |>
  display()
Table 83
Parameter Coefficient SE 95% CI t(160) p
(Intercept) -0.01 0.02 (-0.06, 0.03) -0.50 0.619
attractiveness -0.06 0.02 (-0.10, -0.02) -2.93 0.004
spouse (Wife) 0.02 0.03 (-0.04, 0.09) 0.75 0.455
attractiveness × spouse (Wife) 0.11 0.03 (0.05, 0.16) 3.54 < .001
Important Write it up!

The relationship between support and attractiveness was significantly moderated by whether the data were from the husband or wife, \(\hat{b}\) = 0.11 (0.05, 0.16), t(16) = 3.54, p < 0.001.

Task 10.2

Produce the simple slopes analysis for Task 1.

Table 84 tells us that the relationship between attractiveness of a person and amount of support given to their spouse is different for husbands and wives. Specifically, for wives, as attractiveness increases the level of support that they give to their husbands increases, whereas for husbands, as attractiveness increases the amount of support they give to their wives decreases.

support_ss <- estimate_slopes(model = support_lm,
                              trend = "attractiveness",
                              by = "spouse")

display(support_ss)
Table 84: Estimated Marginal Effects
spouse Slope SE 95% CI t(160) p
Husband -0.06 0.02 (-0.10, -0.02) -2.97 0.003
Wife 0.05 0.02 ( 0.01, 0.08) 2.35 0.020

Marginal effects estimated for attractiveness; Type of slope was dY/dX

Important Write it up!

For husbands, there was a significant negative relationship between attractiveness and support, \(\hat{b}\) = -0.06 (-0.10, -0.02), t(16) = -2.97, p = 0.003. For wives there was a significant positive relationship between attractiveness and support, \(\hat{b}\) = 0.05 (0.01, 0.08), t(16) = 2.35, p = 0.020.

Task 10.3

McNulty et al. (2008) also found a relationship between a person’s attractiveness and their relationship satisfaction among newlyweds. Using the same data as in Tasks 1 and 2, find out if this relationship is moderated by spouse.

The data were loaded and variables centred in Task 1. To fit this model use:

satisfaction_lm <- lm(satisfaction ~ attractiveness*spouse, data = mcnulty_ctib)

Table 85 shows the fit statistics

model_performance(satisfaction_lm) |> 
  display()
Table 85
AIC AICc BIC R2 R2 (adj.) RMSE Sigma
960.6 961.0 976.1 0.03 9.97e-03 4.39 4.44

Evaluate the model assumptions in Figure 24. There’s potential heteroscedasticity.

check_model(satisfaction_lm)
Figure 24

Table 86 shows the robust parameter estimates

model_parameters(satisfaction_lm, vcov = "HC4") |>
  display()
Table 86
Parameter Coefficient SE 95% CI t(160) p
(Intercept) 0.01 0.41 (-0.79, 0.82) 0.04 0.972
attractiveness -0.88 0.38 (-1.64, -0.13) -2.32 0.021
spouse (Wife) -0.02 0.70 (-1.40, 1.35) -0.03 0.973
attractiveness × spouse (Wife) 0.55 0.58 (-0.59, 1.68) 0.95 0.343
Important Write it up!

The relationship between marital satisfaction and attractiveness was not significantly moderated by whether the data were from the husband or wife, \(\hat{b}\) = 0.55 (-0.59, 1.68), t(16) = 0.95, p = 0.343.

Task 10.4:

In this chapter we tested a mediation model of infidelity for Lambert et al.’s data (Lambert et al., 2012). Repeat this analysis but using hook_ups as the measure of infidelity.

Load the data using (see Tip 2):

infidelity_tib <- discovr::lambert_2012

Specify the model by changing the outcome in the code within the book chapter:

hookup_mod <- 'hook_ups ~ c*ln_porn + b*commit
                   commit ~ a*ln_porn

                   indirect_effect := a*b
                   total_effect := c + (a*b)
                   '

Fit the model

hookup_fit <- lavaan::sem(hookup_mod,
                          data = infidelity_tib,
                          missing = "FIML",
                          estimator = "MLR")

Table 87 shows the parameter estimates

model_parameters(hookup_fit) |> 
  display()
Table 87: # Defined
Link Coefficient SE 95% CI z p
hook_ups ~ ln_porn (c) 1.29 0.52 (0.28, 2.30) 2.50 0.013
hook_ups ~ commit (b) -0.62 0.13 (-0.88, -0.36) -4.72 < .001
commit ~ ln_porn (a) -0.47 0.23 (-0.92, -0.02) -2.06 0.040
To Coefficient SE 95% CI z p
(indirect_effect) 0.29 0.15 (-7.47e-03, 0.59) 1.91 0.056
(total_effect) 1.58 0.55 (0.51, 2.65) 2.89 0.004
Important Write it up!

Pornography consumption significantly predicted hook-ups, \(\hat{b}\) = 1.29 (0.28, 2.30), z = 2.50, p = 0.013. As pornography consumption increases, the number of hook-ups increased also. Pornography consumption significantly predicted relationship commitment, \(\hat{b}\) = -0.47 (-0.92, -0.02), z = -2.06, p = 0.040. As pornography consumption increased commitment declines. The indirect effect of pornography consumption on hook_ups through relationship commitment was not quite significant, \(\hat{b}\) = 0.29 (-0.01, 0.59), z = 1.91, p = 0.056.

Is there evidence for mediation? As such, there is not significant mediation (although there nearly is …).

Task 10.5:

Tablets like the iPad are very popular. A company owner was interested in how to make his brand of tablets more desirable. He collected data on how cool people perceived a product’s advertising to be (advert_cool), how cool they thought the product was (product_cool) and how desirable they found the product (desirability). Test his theory that the relationship between cool advertising and product desirability is mediated by how cool people think the product is (tablets.csv). Am I showing my age by using the word ‘cool’?

Load the data using (see Tip 2):

tablet_tib <- discovr::tablets

Specify the model:

tablet_mod <- 'desirability ~ c*advert_cool + b*product_cool
                   product_cool ~ a*advert_cool

                   indirect_effect := a*b
                   total_effect := c + (a*b)
                   '

Fit the model

tablet_fit <- lavaan::sem(tablet_mod,
                          data = tablet_tib,
                          estimator = "MLR")

Table 88 shows the parameter estimates

model_parameters(tablet_fit) |> 
  display()
Table 88: # Defined
Link Coefficient SE 95% CI z p
desirability ~ advert_cool (c) 0.20 0.06 (0.07, 0.33) 3.11 0.002
desirability ~ product_cool (b) 0.23 0.06 (0.12, 0.35) 3.96 < .001
product_cool ~ advert_cool (a) 0.15 0.06 (0.03, 0.27) 2.43 0.015
To Coefficient SE 95% CI z p
(indirect_effect) 0.04 0.02 (1.23e-03, 0.07) 2.03 0.042
(total_effect) 0.24 0.06 (0.11, 0.36) 3.68 < .001
  • Advert coolness significantly predicted desirability, \(\hat{b}\) = 0.20 (0.07, 0.33), z = 3.11, p = 0.002. As advert coolness increased, desirability increases also.
  • Advert coolness significantly predicted perceptions of a product, \(\hat{b}\) = 0.15 (0.03, 0.27), z = 2.43, p = 0.015. As advert coolness increased, product coolness increased also.
  • Product coolness significantly predicted desirability, \(\hat{b}\) = 0.23 (0.12, 0.35), z = 3.96, p < 0.001. Products perceived to be more cool were more desirable.
  • The indirect effect of advert coolness on desirability through product coolness was significant, \(\hat{b}\) = 0.04 (0.00, 0.07), z = 2.03, p = 0.042. As such, there was evidence of significant mediation.

You could report this as:

Important Write it up!

There was a significant indirect effect of how cool people think a products’ advertising is on the desirability of the product though how cool they think the product is, \(\hat{b}\) = 0.04 (0.00, 0.07), z = 2.03, p = 0.042. This represents a relatively small effect.

Chapter 11

Task 11.1

To test how different teaching methods affected students’ knowledge I took three statistics modules (group) where I taught the same material. For one module I wandered around with a large cane and beat anyone who asked daft questions or got questions wrong (punish). In the second I encouraged students to discuss things that they found difficult and gave anyone working hard a nice sweet (reward). In the final course I neither punished nor rewarded students’ efforts (indifferent). I measured the students’ exam marks (percentage). The data are in the file teach_method.csv. Fit a model with planned contrasts to test the hypotheses that: (1) reward results in better exam results than either punishment or indifference; and (2) indifference will lead to significantly better exam results than punishment.

Load the data using (see Tip 2):

teach_tib <- discovr::teach_method

Let’s do the plot first. We can get an error bar plot in Figure 25 as follows:

ggplot(teach_tib, aes(group, exam)) +
  stat_summary(fun.data = "mean_cl_boot", colour = "#3E9BFEFF") +
  coord_cartesian(ylim = c(0, 70)) +
  scale_y_continuous(breaks = seq(0, 70, 10)) +
  labs(x = "Method of teaching", y = "Percentage on exam") +
  theme_minimal()
Figure 25

The plot shows that marks are highest after reward and lowest after punishment.

To fit the model we need to think about the hypotheses:

NoteHypotheses
  1. Reward results in better exam results than either punishment or indifference
  2. Indifference will lead to significantly better exam results than punishment

To test our hypotheses we need to first enter the codes for the contrasts in Table 89.

Table 89: Contrast codes for the teaching data
Group Reward vs. other Indifference vs. punish
Punish 1/3 1/2
Indifferent 1/3 -1/2
Reward -2/3 0

Contrast 1 tests hypothesis 1: reward results in better exam results than either punishment or indifference. To test this we compare the reward condition against the other two. So we compare chunk 1 (punish, indifferent) to chunk 2 (reward). In the table, the numbers assigned to the groups are the number of groups in the opposite chunk divided by the number of groups that have non-zero codes, and we randomly assigned one chunk to be a negative value. (The codes -1/3, -1/3, 2/3 would work fine as well.)

Contrast 2 tests hypothesis 2: indifference will lead to significantly better exam results than punishment. First we get rid of the reward condition by assigning a code of 0; next we compare chunk 1 (indifferent) to chunk 2 (punish). The numbers assigned to the groups are the number of groups in the opposite chunk divided by the number of groups that have non-zero codes, and then we randomly assigned one chunk to be a negative value. (The codes -1/2, 1/2, 0 would work fine as well.)

We set these contrasts for the variable group as follows:

reward_vs_other <- c(1/3, 1/3, -2/3)
indifferent_vs_punish <- c(1/2, -1/2, 0)

contrasts(teach_tib$group) <- cbind(reward_vs_other, indifferent_vs_punish)
contrasts(teach_tib$group) # check the contrasts are set correctly
            reward_vs_other indifferent_vs_punish
Punish            0.3333333                   0.5
Indifferent       0.3333333                  -0.5
Reward           -0.6666667                   0.0

Next we fit the model using this code:

teach_lm <- lm(exam ~ group, data = teach_tib)

anova(teach_lm) |>
  model_parameters(es_type = "omega", ci = 0.95) |>
  display()
Table 90
Parameter Sum_Squares df Mean_Square F p Omega2 Omega2 95% CI
group 1205.07 2 602.53 21.01 < .001 0.57 (0.34, 1.00)
Residuals 774.40 27 28.68

Anova Table (Type 1 tests)

Table 90 tells us that there was a significant effect of the teaching method on exam scores, F(2, 27) = 21.01, p < 0.001, \(\hat{\omega}^2\) = 0.57 (0.34, 1.00). View the parameters using:

model_parameters(teach_lm) |>
  display()
Table 91
Parameter Coefficient SE 95% CI t(27) p
(Intercept) 57.13 0.98 (55.13, 59.14) 58.43 < .001
group (reward_vs_other) -12.40 2.07 (-16.66, -8.14) -5.98 < .001
group (indifferent_vs_punish) -6.00 2.40 (-10.91, -1.09) -2.51 0.019

Table 91 shows that hypothesis 1 is supported (groupreward_vs_other): teaching with rewards led to significantly higher exam scores than the other methods, \(\hat{b}\) = -12.40 (-16.66, -8.14), t(27) = -5.98, p < 0.001. Hypothesis 2 is also supported (groupindifferent_vs_punish): teaching with indifference was associated with higher exam scores than teaching with punishment, \(\hat{b}\) = -6.00 (-10.91, -1.09), t(27) = -2.51, p = 0.019.

We can get some basic diagnostic plots (Figure 26). There are no large Cook’s distances, the Q-Q plot suggests normal residuals but the residual vs. fitted plot and the scale-location plot suggest heterogeneity of variance (the columns of dots are different lengths and the red line is not flat especially in the scale location plot), which brings us to Task 2!

check_model(teach_lm)
Figure 26

Task 11.2

Fit a robust model for Task 1.

Table 92 shows a robust F for the teaching data. The Welch F is highly significant like the non-robust F.

oneway.test(exam ~ group, data = teach_tib) |> 
  model_parameters() |> 
  display()
Table 92: One-way analysis of means (not assuming equal variances)
F df df (error) p
32.24 2 17.34 < .001

Table 93 The first contrast is still highly significant and the second contrast is also still significant. As such, our conclusions are unchanged when fitting a model that is robust to heteroscedasticity.

model_parameters(teach_lm, vcov = "HC4") |> 
  display()
Table 93
Parameter Coefficient SE 95% CI t(27) p
(Intercept) 57.13 0.98 (55.13, 59.14) 58.43 < .001
group (reward_vs_other) -12.40 1.88 (-16.26, -8.54) -6.59 < .001
group (indifferent_vs_punish) -6.00 2.60 (-11.33, -0.67) -2.31 0.029

Task 11.3

Children wearing superhero costumes are more likely to harm themselves because of the unrealistic impression of invincibility that these costumes could create. For example, children have reported to hospital with severe injuries because of trying ‘to initiate flight without having planned for landing strategies’ (Davies, Surridge, Hole, & Munro-Davies, 2007). I can relate to the imagined power that a costume bestows upon you; indeed, I have been known to dress up as Fisher by donning a beard and glasses and trailing a goat around on a lead in the hope that it might make me more knowledgeable about statistics. Imagine we had data (superhero.csv) about the severity of injury (on a scale from 0, no injury, to 100, death) for children reporting to the accident and emergency department at hospitals, and information on which superhero costume they were wearing (hero): Spiderman, Superman, the Hulk or a teenage mutant ninja turtle. Fit a model with planned contrasts to test the hypothesis that those wearing costumes of flying superheroes (Superman and Spiderman) have more severe injuries.

To get you into the mood for hulk-related data analysis, Figure 1 shows my wife and I on the Hulk rollercoaster at the islands of adventure in Orlando, on our honeymoon.

Andy and his wife on the Hulk rollercoaster.

The Fields on the Hulk

Not in the mood yet? OK, let’s ramp up the ‘weird’ with a photo (Figure 2) of some complete strangers reading one of my textbooks on the same Hulk roller-coaster that my wife and I rode (not at the same time as the book readers I should add). Nothing says ‘I love your textbook’ like taking it on a stomach-churning high speed ride. I dearly wish that reading my books on roller coasters would become a ‘thing’.

Random American students reading my book on the Hulk rollercoaster.

Random American students reading my book on the Hulk

Load the data using (see Tip 2):

hero_tib <- discovr::superhero

Lets start with a violin plot of the data with some error bars (Figure 27):

ggplot(hero_tib, aes(hero, injury)) +
  geom_violin(colour = "#3E9BFEFF", fill = "#3E9BFEFF", alpha = 0.1) +
  stat_summary(fun.data = "mean_cl_boot", colour = "#3E9BFEFF") +
  coord_cartesian(ylim = c(0, 90)) +
  scale_y_continuous(breaks = seq(0, 90, 10)) +
  labs(x = "Costume worn", y = "Injury severity (0-100") +
  theme_minimal()
Figure 27

To test our main hypotheses we need to first work out some contrasts. In the book, we learned rules to generate contrasts. Figure 3 shows how we would apply these to the Superhero example to plan the contrasts that we want. We’re told that we want to compare flying superheroes (i.e. Superman and Spiderman) against non-flying ones (the Hulk and Ninja Turtles) in the first instance. That will be contrast 1. However, because each of these chunks is made up of two groups (e.g., the flying superheroes chunk comprises both children wearing Spiderman and those wearing Superman costumes), we need a second and third contrast that breaks each of these chunks down into their constituent parts.

Contrast 1 compares flying (Superman, Spiderman) to non-flying (Hulk, Turtle) superheroes. Contrast 2 compares Superman to  Spiderman ignoring the non-flying (Hulk, Turtle) superheroes. Contrast 3 compares the Hulk to Ninja Turtles ignoring the flying (Superman, Spiderman) superheroes.

Planned contrasts for the superhero example

Contrast 1 compares flying (Superman, Spiderman) to non-flying (Hulk, Turtle) superheroes. Each chunk contains two groups, so the weights for the opposite chunks are both 2 divided by the number of groups with non-zero weights. This gives us \(\frac{2}{4} = \frac{1}{2}\).We assign one chunk positive weights and the other negative weights (in Table 94 I’ve chosen the flying superheroes to have positive weights, but you could do it the other way around).

Contrast two then compares the two flying superheroes to each other. First we assign both non-flying superheroes a 0 weight to remove them from the contrast. We’re left with two chunks: one containing the Superman group and the other containing the Spiderman group. Each chunk contains one group, so the weights for the opposite chunks are both 1 divided by the number of groups with non-zero weights (i.e. 1/2). We assign one chunk positive weights and the other negative weights (in Table 94 I’ve given Superman the positive weight, but you could do it the other way around).

Finally, Contrast three compares the two non-flying superheroes to each other. First we assign both flying superheroes a 0 weight to remove them from the contrast. We’re left with two chunks: one containing the Hulk group and the other containing the Turtle group. Each chunk contains one group, so the weights for the opposite chunks are both 1/2 (for the same reason as contrast 2). We assign one chunk positive weights and the other negative weights (in Table 94 I’ve given the Hulk the positive weight, but you could do it the other way around).

Table 94: Contrast codes for the superhero data
Group Flying vs. non Superman vs. Spiderman Hulk vs. Ninja turtle
Superman 1/2 1/2 0
Spiderman 1/2 -1/2 0
Hulk -1/2 0 1/2
Ninja Turtle -1/2 0 -1/2

We set these contrasts for the variable hero as follows:

flying_vs_not <- c(1/2, 1/2, -1/2, -1/2)
super_vs_spider <- c(1/2, -1/2, 0, 0)
hulk_vs_ninja <- c(0, 0, 1/2, -1/2)

contrasts(hero_tib$hero) <- cbind(flying_vs_not, super_vs_spider, hulk_vs_ninja)
contrasts(hero_tib$hero) # check the contrasts are set correctly
             flying_vs_not super_vs_spider hulk_vs_ninja
Superman               0.5             0.5           0.0
Spiderman              0.5            -0.5           0.0
Hulk                  -0.5             0.0           0.5
Ninja Turtle          -0.5             0.0          -0.5

Next we fit the model using this code:

hero_lm <- lm(injury ~ hero, data = hero_tib)

anova(hero_lm) |>
  model_parameters(es_type = "omega", ci = 0.95) |>
  display()
Table 95
Parameter Sum_Squares df Mean_Square F p Omega2 Omega2 95% CI
hero 4180.62 3 1393.54 8.32 < .001 0.42 (0.14, 1.00)
Residuals 4356.58 26 167.56

Anova Table (Type 1 tests)

Table 95 tells us that there was a significant effect of the teaching method on exam scores, F(3, 26) = 8.32, p < 0.001, \(\hat{\omega}^2\) = 0.42 (0.14, 1.00).

We can get evaluate the model by looking at Figure 28. There are no large Cook’s distances, the Q-Q plot suggests non-normal residuals and the residual vs fitted plot and the scale-location plot suggest heterogeneity of variance (the columns of dots are different lengths and the line is not flat especially in the scale location plot).

check_model(hero_lm)
Figure 28

Let’s fit a robust model as a sensitivity check:

oneway.test(injury ~ hero, data = hero_tib) |> 
  model_parameters() |> 
  display()
One-way analysis of means (not assuming equal variances)
F df df (error) p
7.10 3 13.02 0.005

The Welch F is highly significant still. Now view the (robust) parameter estimates table using:

model_parameters(hero_lm, vcov = "HC4") |>
  display()
Table 96
Parameter Coefficient SE 95% CI t(26) p
(Intercept) 40.90 2.56 (35.64, 46.15) 16.00 < .001
hero (flying_vs_not) 20.17 5.11 (9.66, 30.67) 3.95 < .001
hero (super_vs_spider) 18.71 8.61 (1.02, 36.40) 2.17 0.039
hero (hulk_vs_ninja) 9.12 5.52 (-2.22, 20.47) 1.65 0.110

Table 96 shows that hypothesis 1 is supported: flying costumes were associated with more severe injuries than non-flying ones, \(\hat{b}\) = 20.17 (9.66, 30.67), t(26) = 3.95, p < 0.001. The other contrasts tell us that Superman costumes were associated with more sever injuries than Spiderman costumes, \(\hat{b}\) = 18.71 (1.02, 36.40), t(26) = 2.17, p = 0.039, but that injury severity was not significantly different between Hulk and Ninja turtle costumes, \(\hat{b}\) = 9.12 (-2.22, 20.47), t(26) = 1.65, p = 0.110.

Task 11.4

I read a story in a newspaper (yes, back when they existed) claiming that the chemical genistein, which is naturally occurring in soya, was linked to lowered sperm counts in Western males. When you read the actual study, it had been conducted on rats, it found no link to lowered sperm counts, but there was evidence of abnormal sexual development in male rats (probably because genistein acts like oestrogen). As journalists tend to do, a study showing no link between soya and sperm counts was used as the scientific basis for an article about soya being the cause of declining sperm counts in Western males. Imagine the rat study was enough for us to want to test this idea in humans. We recruit 80 males and split them into four groups that vary in the number of soya ‘meals’ (a dinner containing 75g of soya) they ate per week over a year: no soya meals (i.e., none in the whole year), one per week (52 over the year), four per week (208 over the year), and seven per week (364 over the year). At the end of the year, participants produced some sperm that I could count (when I say ‘I’, I mean someone else in a laboratory as far away from me as humanly possible). The data are in soya.csv

Load the data using (see Tip 2):

soya_tib <- discovr::soya

Lets start with a boxplot of the data (Figure 29):

ggplot(soya_tib, aes(soya, sperm)) +
  geom_boxplot(colour = "#F05B12FF", fill = "#F05B12FF", alpha = 0.3) +
  coord_cartesian(ylim = c(0, 130)) +
  scale_y_continuous(breaks = seq(0, 130, 10)) +
  labs(x = "Number of soya meals per week", y = "Sperm count (Millions)") +
  theme_minimal()
Figure 29

A boxplot of the data suggests that (1) scores within conditions are skewed; and (2) variability in scores is different across groups. Let’s verify this using some residual plots (Figure 30):

sperm_lm <- lm(sperm ~ soya, data = soya_tib)
check_model(sperm_lm)
Figure 30

Yep, just as I suspected, the residuals are properly fucked. They’re hideously non-normal and the non-flat green line in the middle_left plot suggest heteroscedasticity. No problem, we’ll fit a robust model. By default will dummy code the predictor with the first category (no soya meals) as the reference, which makes sense so we don’t need to set contrasts.

Let’s fit a robust model. Table 97 shows that The Welch F is highly significant suggesting that sperm counts are significantly affected by the amount of Soya meals you eat per week, F(3, 34.66) = 6.28, p = 0.002.

sperm_rob <- oneway.test(sperm ~ soya, data = soya_tib)
model_parameters(sperm_rob) |>
  display()
Table 97: One-way analysis of means (not assuming equal variances)
F df df (error) p
6.28 3 34.66 0.002

Now the robust parameters:

model_parameters(model = sperm_lm, vcov = "HC4") |> 
  display()
Table 98
Parameter Coefficient SE 95% CI t(76) p
(Intercept) 49.87 11.37 (27.22, 72.51) 4.39 < .001
soya (1 soya meal per week) -3.82 15.44 (-34.57, 26.94) -0.25 0.806
soya (4 soya meals per week) -8.77 15.05 (-38.74, 21.21) -0.58 0.562
soya (7 soya meals per week) -33.34 11.64 (-56.51, -10.16) -2.86 0.005
Important Write it up!

Sperm counts in people who had 1, \(\hat{b}\) = -3.82 (-34.57, 26.94), t(76) = -0.25, p = 0.806, or 4 soya meals, \(\hat{b}\) = -8.77 (-38.74, 21.21), t(76) = -0.58, p = 0.562, per week were not significantly different to having no soya in the diet. However, people who ate soya every day had significantly lower sperm counts than those with no soya in their diet, \(\hat{b}\) = -33.34 (-56.51, -10.16), t(76) = -2.86, p = 0.005.

Task 11.5

Mobile phones emit microwaves, and so holding one next to your brain for large parts of the day is a bit like sticking your brain in a microwave oven and pushing the ‘cook until well done’ button. If we wanted to test this experimentally, we could get six groups of people and strap a mobile phone on their heads, then by remote control turn the phones on for a certain amount of time each day. After six months, we measure the size of any tumour (in mm3) close to the site of the phone antenna (just behind the ear). The six groups experienced 0, 1, 2, 3, 4 or 5 hours per day of phone microwaves for six months. Do tumours significantly increase with greater daily exposure? The data are in tumour.csv.

Load the data using (see Tip 2):

tumour_tib <- discovr::tumour

Plot the data

Lets start with an error bar plot of the data (Figure 31):

ggplot(tumour_tib, aes(usage, tumour)) +
  stat_summary(fun.data = "mean_cl_boot", colour = "#3E9BFEFF") +
  coord_cartesian(ylim = c(0, 7)) +
  scale_y_continuous(breaks = 0:7) +
  labs(x = "Hours on phone per day", y = bquote(Size~of~tumour~(mm^3))) +
  theme_minimal()
Figure 31

Figure 31 shows the mean size of brain tumour in each condition, and the funny ‘I’ shapes show the confidence interval of these means. Note that in the control group (0 hours), the mean size of the tumour is virtually zero (we wouldn’t expect them to have a tumour) and the error bar shows that there was very little variance across samples - this almost certainly means we cannot assume equal variances. Let’s verify this by fitting a model and viewing the residual plots (Figure 32).

phone_lm <- lm(tumour ~ usage, data = tumour_tib)
check_model(phone_lm)
Figure 32

Holy shit, the residuals were bad in the last task, but these ones be like

These residuals are hideously non-normal and the non-flat green lines on the relevant plots suggest heteroscedasticity. However, we can have the last laugh by fitting a robust model. By default will dummy code the predictor with the first category (no hours on the phone) as the reference, which makes sense so we don’t need to set contrasts.

Table 99 shows that The Welch F is highly significant suggesting tumour size is significantly affected by the amount of hours the phone was active per day, F(5, 44.39) = 414.93, p < 0.001.

phone_rob <- oneway.test(tumour ~ usage, data = tumour_tib)
model_parameters(phone_rob) |>
  display()
Table 99: One-way analysis of means (not assuming equal variances)
F df df (error) p
414.93 5 44.39 < .001

Table 100 shows the parameter estimates and robust confidence intervals and tests.

model_parameters(model = phone_lm, vcov = "HC4") |> 
  display()
Table 100
Parameter Coefficient SE 95% CI t(114) p
(Intercept) 0.02 2.71e-03 (0.01, 0.02) 6.47 < .001
usage (1 hour) 0.50 0.06 (0.37, 0.62) 7.82 < .001
usage (2 hours) 1.24 0.11 (1.03, 1.46) 11.30 < .001
usage (3 hours) 3.00 0.17 (2.66, 3.34) 17.55 < .001
usage (4 hours) 4.87 0.16 (4.56, 5.18) 31.28 < .001
usage (5 hours) 4.71 0.17 (4.37, 5.06) 26.96 < .001
Important Write it up!

Table 100 shows the parameter estimates and confidence intervals and tests based on heteroscedasticity consistent standard errors (HC4). Tumour sizes were significantly larger in every group compared to the control group. As such, people with a phone active next to their head for 1, 2, 3, 4 or 5 hours per day had, on average, significantly larger tumours than people who did not have an active phone strapped to their head.

Task 11.6

Using the Glastonbury data from Chapter 8 (glastonbury.csv), fit a model to see if the change in hygiene (change) is significant across people with different musical tastes (music). Use a simple contrast to compare each group against the no affiliation group.

Load the data using (see Tip 2):

glasters_tib <- discovr::glastonbury

Lets start with a boxplot of the data (Figure 33):

glasters_tib |>  
  filter(!is.na(change)) |>  # filter out cases that have NA (missing values) on the outcome variable
  ggplot(data = _, aes(subculture, change)) +
  geom_boxplot(colour = "#F05B12FF", fill = "#F05B12FF", alpha = 0.3) +
  coord_cartesian(ylim = c(-3, 1.5)) +
  scale_y_continuous(breaks = seq(-3, 1.5, 0.5)) +
  labs(x = "Subculture", y = "Change in hygiene scores across the festival") +
  theme_minimal()
Figure 33

R§eember that negative scores mean that hygiene declines across the festival. Figure 33 suggests that there was the least decline in hygiene for people with no subculture and metalheads, and greater hygiene decline for ravers who had a similar change to hipsters. There looks like a bit of skew and different spreads of scores which might suggest the knock on effects of non-normal residuals and heteroscedasticity but let’s verify this by fitting a model and using it to view residual plots (Figure 34).

stink_lm <- lm(change ~ subculture, data = glasters_tib)
check_model(stink_lm)
Figure 34

The Q-Q plot suggests normal residuals but the linearity and homoscedasticity plots look weird (Figure 34). This is because check_model() sometimes scale the y-axis weirdly. Figure 35 and Figure 36 use the native function for plotting residuals, which are scaled better and show lovely flat reference lines. Happy days.

plot(stink_lm, which = 1)
Figure 35
plot(stink_lm, which = 3)
Figure 36

Let’s fit the model. By default will dummy code the predictor with the first category (no subculture) as the reference, which makes sense so we don’t need to set contrasts.

Next we fit the model using this code:

stink_lm <- lm(change ~ subculture, data = glasters_tib) # not really necessary because we already did this to get the residual plots

anova(stink_lm) |>
  model_parameters(es_type = "omega", ci = 0.95) |>
  display()
Table 101
Parameter Sum_Squares df Mean_Square F p Omega2 Omega2 95% CI
subculture 4.65 3 1.55 3.27 0.024 0.05 (0.00, 1.00)
Residuals 56.36 119 0.47

Anova Table (Type 1 tests)

Table 101 tells us that there was a significant effect of the subculture membership on the change in hygiene scores, F(3, 119) = 3.27, p = 0.024, \(\hat{\omega}^2\) = 0.05 (0.00, 1.00). View the parameters using:

model_parameters(stink_lm) |>
  display()
Table 102
Parameter Coefficient SE 95% CI t(119) p
(Intercept) -0.55 0.09 (-0.73, -0.38) -6.13 < .001
subculture (Raver) -0.41 0.17 (-0.74, -0.08) -2.46 0.015
subculture (Metalhead) 0.03 0.16 (-0.29, 0.35) 0.18 0.860
subculture (Hipster) -0.41 0.20 (-0.82, -4.21e-03) -2.00 0.048

Table 102 shows that hypothesis 1 is supported (groupreward_vs_other): teaching with rewards led to significantly higher exam scores than the other methods, \(\hat{b}\) = -12.40 (-16.66, -8.14), t(27) = -5.98, p < 0.001. Hypothesis 2 is also supported (groupindifferent_vs_punish): teaching with indifference was associated with higher exam scores than teaching with punishment, \(\hat{b}\) = -6.00 (-10.91, -1.09), t(27) = -2.51, p = 0.019

Important Write it up!

Results showed that the change in hygiene scores was not statistically different between those with no subculture and metalheads, \(\hat{b}\) = 0.03 (-0.29, 0.35), t(119) = 0.18, p = 0.860. However, ravers became significantly more smelly (the negative change in hygiene scores was greater in magnitude) than those with no subculture,\(\hat{b}\) = -0.41 (-0.74, -0.08), t(119) = -2.46, p = 0.015, the same was true of hipsters compared to those with no subculture, \(\hat{b}\) = -0.41 (-0.82, -0.00), t(119) = -2.00, p = 0.048.

Task 11.7

Repeat the analysis in Labcoat Leni’s Real Research 11.2 but using copulatory efficiency as the outcome.

Load the data using (see Tip 2):

eggs_tib <- discovr::cetinkaya_2006

Let’s first look at some boxplots (Figure 37):

ggplot(eggs_tib, aes(groups, efficiency)) +
  geom_boxplot(colour = "#3E9BFEFF", fill = "#3E9BFEFF", alpha = 0.4) +
  coord_cartesian(ylim = c(0, 35)) +
  scale_y_continuous(breaks = seq(0, 35, 5)) +
  labs(x = "Fetish group", y = "Copulatory efficiency (out of 33)") +
  theme_minimal()
Figure 37

There are outliers and skew in the non-fetishistic group and skew in the control group also. This is confirmed by diagnostic plots from a standard linear model (Figure 38).1

1 Again, the residual plots are a little off here so we can also look at the the ones from base in Figure 39 and Figure 40

egg_lm <- lm(efficiency ~ groups, data = eggs_tib)
check_model(egg_lm)
Figure 38
plot(egg_lm, which = 1)
Figure 39
plot(egg_lm, which = 3)
Figure 40

The authors were wise to fit a nonparametric test. We’ll use a 20% trimmed mean test with post hoc tests.

Table 103 tells us that there was a significant effect, F(2, 19.64) = 20.47, p < 0.001. Although we’ve applied a robust test rather than a nonparametric one the results of the study are confirmed.

egg_rob <- WRS2::t1way(efficiency ~ groups, data = eggs_tib, nboot = 1000)
model_parameters(egg_rob) |>
  display()
Table 103: A heteroscedastic one-way ANOVA for trimmed means
F df df (error) p Estimate 95% CI Effectsize
20.47 2 19.64 < .001 0.67 (0.48, 0.88) Explanatory measure of effect size

Let’s look at the post hoc tests in Table 104. There was no significant difference between the control group and the fetishistic group, \(\hat{\Psi}\) = -0.03 (-3.33, 3.27), p = 0.983, but significant differences were found between the control group and the nonfetishistic group, \(\hat{\Psi}\) = 8.04 (4.49, 11.59), p < 0.001, and between the fetishistic group and the non-fetishistic group, \(\hat{\Psi}\) = -8.07 (-11.86, -4.28), p < 0.001. We know by looking at the boxplot (the medians in particular) that the nonfetishistic males yielded significantly higher efficiency scores than both the non-fetishistic male quail and the control male quail. These results confirm the findings reported from the nonparametric tests in the paper (page 430):

A one-way ANOVA yielded a significant main effect of groups, $ F(2, 56) = 6.04, p < .05, ^2 = 0.18$. Paired comparisons (with the Bonferroni correction) indicated that the nonfetishistic male quail copulated with the live female quail (US) more efficiently than both the fetishistic male quail (mean difference = 6.61; 95% CI = 1.41, 11.82; p < .05) and the control male quail (mean difference = 5.83; 95% CI = 1.11, 10.56; p < .05). The difference between the efficiency scores of the fetishistic and the control male quail was not significant (mean difference = 0.78; 95% CI = $ -5.33, 3.77 $; p > .05).

egg_ph <- WRS2::lincon(efficiency ~ groups, data = eggs_tib) 

model_parameters(egg_ph) |> 
  format_table() |> 
  display()
Table 104
Group1 Group2 Psihat 95% CI p
Fetishistics NonFetishistics -8.07 [-11.86, -4.28] < .001
Fetishistics Control -0.03 [ -3.33, 3.27] 0.983
NonFetishistics Control 8.04 [ 4.49, 11.59] < .001

Task 11.8

A sociologist wanted to compare murder rates (murder) recorded in each month in a year at three high-profile locations in London (street): Ruskin Avenue, Acacia Avenue and Rue Morgue. Fit a robust model with bootstrapping to see in which streets the most murders happened. The data are in murder.csv.

Load the data using (see Tip 2):

murder_tib <- discovr::murder

Let’s first look at some boxplots (Figure 37):

ggplot(murder_tib, aes(street, murder)) +
  geom_boxplot(colour = "#3E9BFEFF", fill = "#3E9BFEFF", alpha = 0.4) +
  coord_cartesian(ylim = c(0, 7)) +
  scale_y_continuous(breaks = 0:7) +
  labs(x = "Street name", y = "Number of murders per month") +
  theme_minimal()
Figure 41

It looks like most murders were in the Rue Morgue, but there is skew in all of the groups. This is confirmed by diagnostic plots from a standard linear model (Figure 42):

murder_lm <- lm(murder ~ street, data = murder_tib)
check_model(murder_lm)
Figure 42

The residuals are both non-normal and heteroscedastic. We’ll use a 20% trimmed mean with bootstrap test with post hoc tests.

set.seed(1980) # Because it's using a bootstrap I'm setting a seed so the results match the main text
WRS2::t1waybt(murder ~ street, data = murder_tib, nboot = 1000)
Call:
WRS2::t1waybt(formula = murder ~ street, data = murder_tib, nboot = 1000)

Effective number of bootstrap samples was 901.

Test statistic: 2.6011 
p-value: 0.16315 
Variance explained: 0.384 
Effect size: 0.62 

The output tells us that there was not a significant effect, F = 2.60, p = 0.163, although the effect size is large (38% of variance explained). We don’t need to look at the post hoc tests because the main effect is not significant.

Chapter 12

Task 12.1

A few years back I was stalked. You’d think they could have found someone a bit more interesting to stalk, but apparently times were hard. It wasn’t particularly pleasant, but could have been a lot worse. I imagined a world in which a psychologist tried two different therapies on different groups of stalkers (25 stalkers in each group – this variable is called therapy). To the first group he gave cruel-to-be-kind therapy (every time the stalkers followed him around, or sent him a letter, the psychologist attacked them with a cattle prod). The second therapy was psychodyshamic therapy, in which stalkers were hypnotized and regressed into their childhood to discuss their penis (or lack of penis), their dog’s penis, the seventh penis of a seventh penis and any other penis that sprang to mind. The psychologist measured the number of hours stalking in one week both before (stalk_pre) and after (stalk_post) treatment (stalker.csv). Analyse the effect of therapy on stalking behaviour after therapy, adjusting for the amount of stalking behaviour before therapy.

Load the data using (see Tip 2):

stalk_tib <- discovr::stalker

The design of this study warrants an ANCOVA approach to the model. First visualise the data (Figure 43 and Figure 44)

ggplot(stalk_tib, aes(x = therapy, y = stalk_post, colour = therapy)) +
  geom_point(position = position_jitter(width = 0.1), alpha = 0.6) +
  geom_violin(alpha = 0.2) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.9)) +
  coord_cartesian(ylim = c(0, 100)) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  scale_colour_viridis_d(begin = 0.3, end = 0.8) +
  labs(x = "Therapy group", y = "Post-therapy stalking (hours)", colour = "Therapy group") +
  theme_minimal() +
  theme(legend.position = "none")
Figure 43
ggplot(stalk_tib, aes(x = stalk_pre, y = stalk_post)) +
  geom_smooth(method = "lm", colour = "#CC6677", fill = "#CC6677", alpha = 0.2) +
  geom_point(colour = "#882255") +
  coord_cartesian(ylim = c(0, 100)) +
  scale_x_continuous(breaks = seq(0, 100, 10)) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  labs(x = "Pre-treatment stalking (hrs)", y = "Post-treatment stalking (hrs)") +
  facet_wrap(~therapy) +
  theme_minimal()
Figure 44
Table 105

With only two therapy groups we don’t need to set contrasts. First let’s test the assumption of parallel slopes/homogeneity of regression slopes. Table 106 shows a non-significant interaction between the therapy type and pre-therapy stalking, F(1, 46) = 0.88, p = 0.353, suggesting that it’s reasonable to fit the parallel slopes model in stalk_lm.

stalk_lm <- lm(stalk_post ~ stalk_pre + therapy, data = stalk_tib)
full_lm <- update(stalk_lm, .~. + therapy:stalk_pre)

test_wald(stalk_lm, full_lm) |>
  display()
Table 106
Name Model df df_diff F p
stalk_lm lm 47
full_lm lm 46 1 0.88 0.353

Models were detected as nested (in terms of fixed parameters) and are compared in sequential order.

Table 107 summarizes the main model. The covariate significantly predicted the outcome variable, F(1, 47) = 50.46, p < 0.001, so the hours spent stalking after therapy depend on the extent of the initial problem (i.e. the hours spent stalking before therapy). More interesting is that after adjusting for the effect of initial stalking behaviour, the effect of therapy is significant, F(1, 47) = 5.49, p = 0.023.

car::Anova(stalk_lm, type = 3) |>
  model_parameters(es_type = "omega", ci = 0.95) |>
  display(use_symbols = TRUE)
Table 107
Parameter Sum_Squares df Mean_Square F p ω² (partial) ω² 95% CI
stalk_pre 4414.60 1 4414.60 50.46 < .001 0.50 (0.33, 1.00)
therapy 480.26 1 480.26 5.49 0.023 0.08 (9.16e-04, 1.00)
Residuals 4111.72 47 87.48

Anova Table (Type 3 tests)

To interpret the results of the main effect of therapy we look at the adjusted means in Table 108. The adjusted means tell us that stalking behaviour was lower after the therapy involving the cattle prod than after Psychodyshamic therapy (after adjusting for baseline stalking).

estimate_means(stalk_lm, by = c("therapy")) |>
  display()
Table 108: Estimated Marginal Means
therapy Mean SE 95% CI t(47)
Cruel to be kind 55.30 1.87 (51.53, 59.06) 29.55
Psychodyshamic 61.50 1.87 (57.74, 65.27) 32.87

Variable predicted: stalk_post; Predictors modulated: therapy; Predictors averaged: stalk_pre (65)

To interpret the covariate look at the parameter estimates (Table 109). The parameter estimate shows that there is a positive relationship between stalking pre-therapy and post-therapy, \(\hat{b}\) = 0.89 (0.64, 1.14), t(47) = 7.10, p < 0.001. That is, higher stalking levels pre-therapy corresponded with higher levels post therapy.

model_parameters(stalk_lm) |> 
  display()
Table 109
Parameter Coefficient SE 95% CI t(47) p
(Intercept) -2.84 8.35 (-19.64, 13.96) -0.34 0.735
stalk pre 0.89 0.13 (0.64, 1.14) 7.10 < .001
therapy (Psychodyshamic) 6.20 2.65 (0.88, 11.53) 2.34 0.023

Evaluate the model assumptions using Figure 45. The plots suggest non-normal residuals and heterogeneity of variance. Which brings us to Task 2.

check_model(stalk_lm)
Figure 45

Task 12.2

Fit a robust model for Task 1

We should fit a robust model. We typically deal with the problem that matters most, which is heteroscedasticity. However, the Q-Q plot showed some fairly extreme deviations (especially in the lower tail) so we’ll re-fit the model using lmrob(), which will down-weight these cases. The robust model, contrary to the non-robust model, suggests that after adjusting for pre-therapy levels of stalking there is no significant difference between the types of treatments. Note also that the parameter estimate for the type of therapy has fallen from 6.2 (non-robust) to 1.6 (robust) suggesting a much smaller effect when robust estimation is used.

Table 110
stalk_rob <- robust::lmRob(stalk_post ~ stalk_pre + therapy, data = stalk_tib)
summary(stalk_rob)

Call:
robust::lmRob(formula = stalk_post ~ stalk_pre + therapy, data = stalk_tib)

Residuals:
       Min         1Q     Median         3Q        Max 
-34.783005  -2.426286   0.003723   1.894398  21.643705 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            9.60233    3.20285   2.998  0.00433 ** 
stalk_pre              0.76178    0.04596  16.575  < 2e-16 ***
therapyPsychodyshamic  1.61503    0.92560   1.745  0.08755 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.207 on 47 degrees of freedom
Multiple R-Squared: 0.5159 

Test for Bias:
            statistic p-value
M-estimate      6.588 0.08627
LS-estimate     9.291 0.02567

Task 12.3

A marketing manager tested the benefit of soft drinks for curing hangovers. He took 15 people and got them drunk. The next morning as they awoke, dehydrated and feeling as though they’d licked a camel’s sandy feet clean with their tongue, he gave five of them water to drink, five of them Lucozade (a very nice glucose-based UK drink) and the remaining five a leading brand of cola (this variable is called drink). He measured how well they felt (on a scale from 0 = I feel like death to 10 = I feel really full of beans and healthy) two hours later (this variable is called well). He measured how drunk the person got the night before on a scale of 0 = straight edge to 10 = flapping about like a haddock out of water (hangover.csv). Fit a model to see whether people felt better after different drinks when adjusting for how drunk they were the night before.

Load the data using (see Tip 2):

cure_tib <- discovr::hangover

The design of this study warrants an ANCOVA approach to the model. First visualise the data (Figure 46 and Figure 47). Notice that the violin plots look like different drinking receptacles (cup, bottle, bowl).

ggplot(cure_tib, aes(x = drink, y = well, colour = drink)) +
  geom_point(position = position_jitter(width = 0.1), alpha = 0.6) +
  geom_violin(alpha = 0.2) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.9)) +
  coord_cartesian(ylim = c(0, 10)) +
  scale_y_continuous(breaks = seq(0, 10, 1)) +
  scale_colour_viridis_d(begin = 0.3, end = 0.8) +
  labs(x = "Drink", y = "Wellness (0-10)", colour = "Drink") +
  theme_minimal() +
  theme(legend.position = "none")
Figure 46
ggplot(cure_tib, aes(x = drunk, y = well)) +
  geom_smooth(method = "lm", colour = "#CC6677", fill = "#CC6677", alpha = 0.2) +
  geom_point(colour = "#882255", position = position_jitter()) +
  coord_cartesian(ylim = c(0, 10)) +
  scale_x_continuous(breaks = seq(0, 10, 1)) +
  scale_y_continuous(breaks = seq(0, 10, 1)) +
  labs(x = "Drunkness (0-10)", y = "Wellness (0-10)") +
  facet_wrap(~drink) +
  theme_minimal()
Figure 47
Table 111

We have three levels of drink (water, Lucozade, Cola). Two of these drinks contain sugar/glucose, so water is a control. Therefore, we could reasonable set a contrast that compares ‘sugary drinks’ to water, and then a second contrast that compares the two sugary drinks. The order of the levels are Water, Lucozade, Cola, which we can confirm by executing:

levels(cure_tib$drink)
[1] "Water"    "Lucozade" "Cola"    

Therefore, we’d set these contrasts as:

sugar_vs_none <- c(-2/3, 1/3, 1/3)
lucoz_vs_cola <- c(0, -1/2, 1/2)

contrasts(cure_tib$drink) <- cbind(sugar_vs_none, lucoz_vs_cola)
contrasts(cure_tib$drink) # execute to check/view the contrast
         sugar_vs_none lucoz_vs_cola
Water       -0.6666667           0.0
Lucozade     0.3333333          -0.5
Cola         0.3333333           0.5

First let’s test the assumption of parallel slopes/homogeneity of regression slopes. Table 112 shows a non-significant interaction between the drink type and how drunk someone was the night before, F(2, 9) = 2.73, p = 0.119, suggesting that it’s reasonable to fit the parallel slopes model in cure_lm.

cure_lm <- lm(well ~ drunk + drink, data = cure_tib)
full_lm <- update(cure_lm, .~. + drink:drunk)

test_wald(cure_lm, full_lm) |>
  display()
Table 112
Name Model df df_diff F p
cure_lm lm 11
full_lm lm 9 2 2.73 0.119

Models were detected as nested (in terms of fixed parameters) and are compared in sequential order.

Table 113 summarizes the main model. The covariate significantly predicted the outcome variable, F(1, 11) = 27.89, p < 0.001, so the drunkenness of the person influenced how well they felt the next day. More interesting is that at average levels of drunkenness the effect of the type of drink used as a hangover cure was significant, F(2, 11) = 4.32, p = 0.041.

car::Anova(cure_lm, type = 3) |>
  model_parameters() |>
  display(use_symbols = TRUE)
Table 113
Parameter Sum_Squares df Mean_Square F p
drunk 11.19 1 11.19 27.89 < .001
drink 3.46 2 1.73 4.32 0.041
Residuals 4.41 11 0.40

Anova Table (Type 3 tests)

Let’s look at the parameter estimates (Table 114). The parameter estimate shows that there was a negative relationship between how drunk a person got and their wellness the following day, \(\hat{b}\) = -0.55 (-0.78, -0.32), t(11) = -5.28, p < 0.001. That is, the more drunk someone got the less well they felt the following day. We can also see that at average levels of drunkeness, the mean wellness the following day was statistically comparable between the sugary drinks and water, \(\hat{b}\) = 0.64 (-0.13, 1.40), t(11) = 1.82, p = 0.095; however, the Cola and Lucozade groups had significantly different mean wellness, \(\hat{b}\) = -0.99 (-1.96, -0.01), t(11) = -2.23, p = 0.047.

model_parameters(cure_lm) |> 
  display()
Table 114
Parameter Coefficient SE 95% CI t(11) p
(Intercept) 7.40 0.39 (6.54, 8.25) 19.01 < .001
drunk -0.55 0.10 (-0.78, -0.32) -5.28 < .001
drink (sugar_vs_none) 0.64 0.35 (-0.13, 1.40) 1.82 0.095
drink (lucoz_vs_cola) -0.99 0.44 (-1.96, -0.01) -2.23 0.047

The adjusted means (Table 115) tell us that people felt most well in the Lucozade group. Wellness was similar in the Cola and Water groups. This finding explains the pattern of significance for the contrasts: because the Cola group has a very similar mean to the Water group, the combined mean of the Lucozade and Cola groups is also quite similar to the mean for water. The Cola mean ‘drags down’ the mean for ‘sugary drinks’. This is evident from the second contrast that shows that Lucozade drinkers had a significantly higher mean than Cola drinkers.

estimate_means(cure_lm, by = c("drink")) |>
  display()
Table 115: Estimated Marginal Means
drink Mean SE 95% CI t(11)
Water 5.11 0.28 (4.48, 5.73) 17.99
Lucozade 6.24 0.30 (5.59, 6.89) 21.13
Cola 5.25 0.30 (4.59, 5.92) 17.41

Variable predicted: well; Predictors modulated: drink; Predictors averaged: drunk (3.4)

Evaluate the model assumptions using Figure 48. The plots suggest heterogeneity of variance, so a wise researcher would abadon this analysis and fit a robust model. Because there was a problem with heteroscedasticity we could use a model that applies heteroscedasticity-consistent standard errors as in Table 116. Contrary to the non-robust model, these robust tests suggest that after adjusting for how much was drunk there are no significant differences at all in wellness between the groups who had different drinks as hangover cures.

check_model(cure_lm)
Figure 48
model_parameters(cure_lm, vcov = "HC4") |> 
  display()
Table 116
Parameter Coefficient SE 95% CI t(11) p
(Intercept) 7.40 0.44 (6.44, 8.36) 16.93 < .001
drunk -0.55 0.11 (-0.79, -0.31) -4.98 < .001
drink (sugar_vs_none) 0.64 0.31 (-0.06, 1.33) 2.02 0.069
drink (lucoz_vs_cola) -0.99 0.52 (-2.13, 0.16) -1.90 0.084

Task 12.4

Compute effect sizes for Task 3 and report the results.

The effect sizes for the main effect of drink were obtained in the previous task by including es_type = "omega", ci = 0.95 on model_parameters() as in Table 117.

car::Anova(cure_lm, type = 3) |>
  model_parameters(es_type = "omega", ci = 0.95) |>
  display(use_symbols = TRUE)
Table 117
Parameter Sum_Squares df Mean_Square F p ω² (partial) ω² 95% CI
drunk 11.19 1 11.19 27.89 < .001 0.64 (0.29, 1.00)
drink 3.46 2 1.73 4.32 0.041 0.31 (0.00, 1.00)
Residuals 4.41 11 0.40

Anova Table (Type 3 tests)

Important Write it up!

We could report the results as follows:

  • The covariate, drunkenness, was significantly related to the how ill the person felt the next day, F(1, 11) = 27.89, p < 0.001, \(\hat{\omega}_p\) = 0.64 [0.29, 1.00]. There was also a significant effect of the type of drink on how well the person felt after adjusting for how drunk they were the night before, F(2, 11) = 4.32, p = 0.041, \(\hat{\omega}_p\) = 0.31 [0.00, 1.00].
  • Robust planned contrasts using HC4 heteroscedasticity-consistent standard errors showed that having a sugary drink did not significantly affect wellness compared to water, \(\hat{b}\) = 0.64 (-0.06, 1.33), t(11) = 2.02, p = 0.069. Drinking Lucozade also did not significantly affect wellness compared to Cola, \(\hat{b}\) = -0.99 (-2.13, 0.16), t(11) = -1.90, p = 0.084.

Task 12.5

The highlight of the elephant calendar is the annual elephant football event in Nepal (Google it). A heated argument burns between the African and Asian elephants. In 2010, the president of the Asian Elephant Football Association, an elephant named Boji, claimed that Asian elephants were more talented than their African counterparts. The head of the African Elephant Football Association, an elephant called Tunc, issued a press statement that read, ‘I make it a matter of personal pride never to take seriously any remark made by something that looks like an enormous scrotum.’ I was called in to settle things. I collected data from the two types of elephant (elephant) over a season and recorded how many goals each elephant scored (goals) and how many years of experience the elephant had (experience). Analyse the effect of the type of elephant on goal scoring, covarying for the amount of football experience the elephant had (elephooty.csv).

Load the data using (see Tip 2):

ele_tib <- discovr::elephooty

The design of this study warrants an ANCOVA approach to the model. First visualise the data (Figure 49 and Figure 50).

ggplot(ele_tib, aes(x = elephant, y = goals, colour = elephant)) +
  geom_point(position = position_jitter(width = 0.1), alpha = 0.6) +
  geom_violin(alpha = 0.2) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.9)) +
  coord_cartesian(ylim = c(0, 10)) +
  scale_y_continuous(breaks = seq(0, 10, 1)) +
  scale_colour_viridis_d(begin = 0.3, end = 0.8) +
  labs(x = "Elephant", y = "Goals", colour = "Elephant") +
  theme_minimal() +
  theme(legend.position = "none")
Figure 49
ggplot(ele_tib, aes(x = experience, y = goals)) +
  geom_smooth(method = "lm", colour = "#CC6677", fill = "#CC6677", alpha = 0.2) +
  geom_point(colour = "#882255", position = position_jitter()) +
  coord_cartesian(ylim = c(0, 9)) +
  scale_x_continuous(breaks = seq(0, 9, 1)) +
  scale_y_continuous(breaks = seq(0, 10, 1)) +
  labs(x = "Experience (Years)", y = "Goals scored") +
  facet_wrap(~elephant) +
  theme_minimal()
Figure 50
Table 118

The categorical predictor (elephant) only has two levels/categories so we don’t need to set contrasts. First let’s test the assumption of parallel slopes/homogeneity of regression slopes. Table 119 shows a non-significant interaction between the type of elephant and their previous experience, F(1, 116) = 0.05, p = 0.828, suggesting that it’s reasonable to fit the parallel slopes model in elephant_lm.

elephant_lm <- lm(goals ~ experience + elephant, data = ele_tib)
full_lm <- update(elephant_lm, .~. + experience:elephant)

test_wald(elephant_lm, full_lm)|>
  display()
Table 119
Name Model df df_diff F p
elephant_lm lm 117
full_lm lm 116 1 0.05 0.828

Models were detected as nested (in terms of fixed parameters) and are compared in sequential order.

Table 120 shows that the experience of the elephant significantly predicted how many goals they scored, F(1, 117) = 9.93, p = 0.002. After adjusting for the effect of experience, the effect of elephant is also significant. In other words, African and Asian elephants differed significantly in the number of goals they scored, F(1, 117) = 8.59, p = 0.004.

car::Anova(elephant_lm, type = 3) |>
  model_parameters() |>
  display(use_symbols = TRUE)
Table 120
Parameter Sum_Squares df Mean_Square F p
experience 32.32 1 32.32 9.93 0.002
elephant 27.95 1 27.95 8.59 0.004
Residuals 380.80 117 3.25

Anova Table (Type 3 tests)

The adjusted means (Table 121) tell us specifically, that African elephants scored significantly more goals than Asian elephants after adjusting for prior experience.

estimate_means(elephant_lm, by = c("elephant")) |>
  display()
Table 121: Estimated Marginal Means
elephant Mean SE 95% CI t(117)
Asian 3.59 0.23 (3.13, 4.05) 15.37
African 4.56 0.23 (4.10, 5.02) 19.52

Variable predicted: goals; Predictors modulated: elephant; Predictors averaged: experience (4.2)

Let’s look at the parameter estimates (Table 122). The parameter estimate shows that there is a positive relationship between the two variables: the more prior football experience the elephant had, the more goals they scored in the season, \(\hat{b}\) = 0.31 (0.11, 0.50), t(117) = 3.15, p = 0.002.

model_parameters(elephant_lm) |> 
  display()
Table 122
Parameter Coefficient SE 95% CI t(117) p
(Intercept) 2.31 0.45 (1.41, 3.21) 5.09 < .001
experience 0.31 0.10 (0.11, 0.50) 3.15 0.002
elephant (African) 0.97 0.33 (0.31, 1.63) 2.93 0.004

Evaluate the model assumptions using Figure 51. The Q-Q plot suggests normal residuals but the scale-location plot suggests heterogeneity (possibly). We could do a sensitivity analysis by applying heteroscedasticity-consistent standard errors to the tests of parameters (Table 123). Both effects are still highly significant using robust standard errors. Happy days.

check_model(elephant_lm)
Figure 51
model_parameters(elephant_lm, vcov = "HC4") |> 
  display()
Table 123
Parameter Coefficient SE 95% CI t(117) p
(Intercept) 2.31 0.45 (1.41, 3.21) 5.07 < .001
experience 0.31 0.09 (0.13, 0.48) 3.46 < .001
elephant (African) 0.97 0.33 (0.33, 1.62) 2.98 0.004

Task 12.6

In Chapter 4 (Task 6) we looked at data from people who had fish or cats as pets and measured their life satisfaction and also how much they like animals (pets.csv). Fit a model predicting life satisfaction from the type of pet a person had and their animal liking score (covariate)

Load the data using (see Tip 2):

pet_tib <- discovr::pets

The design of this study warrants an ANCOVA approach to the model. First visualise the data (Figure 52 and Figure 53).

ggplot(pet_tib, aes(x = pet, y = life_satisfaction, colour = pet)) +
  geom_point(position = position_jitter(width = 0.1), alpha = 0.6) +
  geom_violin(alpha = 0.2) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.9)) +
  coord_cartesian(ylim = c(0, 80)) +
  scale_y_continuous(breaks = seq(0, 80, 10)) +
  scale_colour_viridis_d(begin = 0.3, end = 0.8) +
  labs(x = "Pet", y = "Life satisfaction", colour = "Pet") +
  theme_minimal() +
  theme(legend.position = "none")
Figure 52
ggplot(pet_tib, aes(x = animal, y = life_satisfaction)) +
  geom_smooth(method = "lm", colour = "#CC6677", fill = "#CC6677", alpha = 0.2) +
  geom_point(colour = "#882255", position = position_jitter()) +
  coord_cartesian(ylim = c(0, 80), xlim = c(0, 80)) +
  scale_x_continuous(breaks = seq(0, 80, 10)) +
  scale_y_continuous(breaks = seq(0, 80, 10)) +
  labs(x = "Love of animals", y = "Life satisfaction") +
  facet_wrap(~pet) +
  theme_minimal()
Figure 53
Table 124

The categorical predictor (pet) only has two levels/categories so we don’t need to set contrasts. First let’s test the assumption of parallel slopes/homogeneity of regression slopes. Table 125 shows a non-significant interaction between the type of pet and their love of animals, F(1, 16) = 0.52, p = 0.481, suggesting that it’s reasonable to fit the parallel slopes model in pet_lm.

pet_lm <- lm(life_satisfaction ~ animal + pet, data = pet_tib)
full_lm <- update(pet_lm, .~. + animal:pet)

test_wald(pet_lm, full_lm)|>
  display()
Table 125
Name Model df df_diff F p
pet_lm lm 17
full_lm lm 16 1 0.52 0.481

Models were detected as nested (in terms of fixed parameters) and are compared in sequential order.

Table 126 shows that the love of animals significantly predicted life satisfaction, F(1, 17) = 10.32, p = 0.005. After adjusting for the love of animals, the type of pet had a significant effect on life satisfaction. In other words, at average levels of liking animals whether someone owned a cat or fish significantly affected their life satisfaction, F(1, 17) = 16.45, p < 0.001.

car::Anova(pet_lm, type = 3) |>
  model_parameters() |>
  display(use_symbols = TRUE)
Table 126
Parameter Sum_Squares df Mean_Square F p
animal 1325.40 1 1325.40 10.32 0.005
pet 2112.10 1 2112.10 16.45 < .001
Residuals 2183.14 17 128.42

Anova Table (Type 3 tests)

The adjusted means (Table 127) tell us specifically that life satisfaction was higher in cat owners than fish owners at average levels of liking animals.

estimate_means(pet_lm, by = c("pet")) |>
  display()
Table 127: Estimated Marginal Means
pet Mean SE 95% CI t(17)
Fish 38.55 3.27 (31.64, 45.45) 11.78
Cat 59.56 4.01 (51.10, 68.02) 14.85

Variable predicted: life_satisfaction; Predictors modulated: pet; Predictors averaged: animal (36)

Let’s look at the parameter estimates (Table 128). The parameter estimate shows that there is a positive relationship between liking animals and life satisfaction, \(\hat{b}\) = 0.54 (0.19, 0.90), t(17) = 3.21, p = 0.005: the more you like animals the higher your life satisfaction.

model_parameters(pet_lm) |> 
  display()
Table 128
Parameter Coefficient SE 95% CI t(17) p
(Intercept) 18.94 6.82 (4.56, 33.33) 2.78 0.013
animal 0.54 0.17 (0.19, 0.90) 3.21 0.005
pet (Cat) 21.01 5.18 (10.08, 31.94) 4.06 < .001

Evaluate the model assumptions using Figure 54. The Q-Q plot suggests non-normal residuals and the homogeneity of variance plot suggests heteroscedasticity. We should deal with the most problematic issue by applying heteroscedasticity-consistent standard errors to the tests of parameters (Table 129). Both effects are still highly significant using robust standard errors so our conclusions don’t change.

check_model(pet_lm)
Figure 54
model_parameters(pet_lm, vcov = "HC4") |> 
  display()
Table 129
Parameter Coefficient SE 95% CI t(17) p
(Intercept) 18.94 7.99 (2.08, 35.81) 2.37 0.030
animal 0.54 0.19 (0.15, 0.93) 2.92 0.010
pet (Cat) 21.01 5.00 (10.46, 31.56) 4.20 < .001

Task 12.7

Fit a linear model predicting life satisfaction from the type of pet and the effect of love of animals using what you learnt in Chapter 9. Compare this model to your results for Task 6. What differences do you notice and why?

The conclusions are the same as from the linear model (Table 130), all that has changes is that we haven’t done overall model fit using F-statistic. Basically, this task is all about re-iterating the false distinction that exists between ‘ANCOVA’ and ‘regressio/linear model’ you are, in both cases, using the general linear model.

pet_lm <- lm(life_satisfaction ~ animal + pet, data = pet_tib)
model_parameters(pet_lm, vcov = "HC4") |> 
  display()
Table 130
Parameter Coefficient SE 95% CI t(17) p
(Intercept) 18.94 7.99 (2.08, 35.81) 2.37 0.030
animal 0.54 0.19 (0.15, 0.93) 2.92 0.010
pet (Cat) 21.01 5.00 (10.46, 31.56) 4.20 < .001

Task 12.8

In Chapter 9 we compared the number of mischievous acts in people who had invisibility cloaks to those without (cloak). Imagine we replicated that study, but changed the design so that we recorded the number of mischievous acts in these participants before the study began (mischief_pre) as well as during the study (mischief). Fit a model to see whether people with invisibility cloaks get up to more mischief than those without when factoring in their baseline level of mischief (invisibility_base.csv).

Load the data using (see Tip 2):

cloak_tib <- discovr::invisibility_base

The design of this study warrants an ANCOVA approach to the model. First visualise the data (Figure 55 and Figure 56).

ggplot(cloak_tib, aes(x = cloak, y = mischief_post, colour = cloak)) +
  geom_point(position = position_jitter(width = 0.1), alpha = 0.6) +
  geom_violin(alpha = 0.2) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.9)) +
  coord_cartesian(ylim = c(0, 15)) +
  scale_y_continuous(breaks = seq(0, 15, 1)) +
  scale_colour_viridis_d(begin = 0.3, end = 0.8) +
  labs(x = "Group", y = "Number of mischievous acts", colour = "Group") +
  theme_minimal() +
  theme(legend.position = "none")
Figure 55
ggplot(cloak_tib, aes(x = mischief_pre, y = mischief_post)) +
  geom_smooth(method = "lm", colour = "#CC6677", fill = "#CC6677", alpha = 0.2) +
  geom_point(colour = "#882255", position = position_jitter()) +
  coord_cartesian(ylim = c(0, 15), xlim = c(0, 10)) +
  scale_x_continuous(breaks = seq(0, 10, 1)) +
  scale_y_continuous(breaks = seq(0, 15, 1)) +
  labs(x = "Number of mischievous acts  (baseline)", y = "Number of mischievous acts (post)") +
  facet_wrap(~cloak) +
  theme_minimal()
Figure 56
Table 131

The categorical predictor (cloak) only has two levels/categories so we don’t need to set contrasts. First let’s test the assumption of parallel slopes/homogeneity of regression slopes. Table 132 shows a non-significant interaction between the group to which people were assigned and their pre-experiment mischief, F(1, 76) = 0.34, p = 0.562, suggesting that it’s reasonable to fit the parallel slopes model in cloak_lm.

cloak_lm <- lm(mischief_post ~ mischief_pre + cloak, data = cloak_tib)
full_lm <- update(cloak_lm, .~. + mischief_pre:cloak)

test_wald(cloak_lm, full_lm)|>
  display()
Table 132
Name Model df df_diff F p
cloak_lm lm 77
full_lm lm 76 1 0.34 0.562

Models were detected as nested (in terms of fixed parameters) and are compared in sequential order.

Table 133 shows that the baseline mischief significantly predicted mischief during the study, F(1, 77) = 7.40, p = 0.008. After adjusting for baseline mischief, whether or not people were given an invisibility cloak significant affected mischief during the study. In other words, at average levels of baseline mischief givng someone an invisibility cloak (or not) significantly affected the number of mischievous acts committed, F(1, 77) = 11.33, p = 0.001.

car::Anova(cloak_lm, type = 3) |>
  model_parameters() |>
  display(use_symbols = TRUE)
Table 133
Parameter Sum_Squares df Mean_Square F p
mischief_pre 22.97 1 22.97 7.40 0.008
cloak 35.17 1 35.17 11.33 0.001
Residuals 239.08 77 3.10

Anova Table (Type 3 tests)

The adjusted means (Table 134) tell us specifically that those with an invisibility cloak committed more mischievous acts.

estimate_means(cloak_lm, by = c("cloak")) |>
  display()
Table 134: Estimated Marginal Means
cloak Mean SE 95% CI t(77)
No cloak 8.79 0.30 (8.19, 9.39) 29.07
Cloak 10.13 0.26 (9.62, 10.65) 38.99

Variable predicted: mischief_post; Predictors modulated: cloak; Predictors averaged: mischief_pre (4.5)

Let’s look at the parameter estimates (Table 135). The parameter estimate shows that there is a positive relationship pre-experiment mischief and subsequent mischief: the more mischievous you are the more mischief you get up to regardless of whether you are given an invisibility cloak, \(\hat{b}\) = 0.25 (0.07, 0.43), t(77) = 2.72, p = 0.008.

model_parameters(cloak_lm) |> 
  display()
Table 135
Parameter Coefficient SE 95% CI t(77) p
(Intercept) 7.68 0.50 (6.69, 8.68) 15.39 < .001
mischief pre 0.25 0.09 (0.07, 0.43) 2.72 0.008
cloak (Cloak) 1.34 0.40 (0.55, 2.14) 3.37 0.001

Evaluate the model assumptions using Figure 57. The Q-Q plot suggests non-normal residuals and the homogeneity of variance plot suggests heteroscedasticity. We deal with the most problematic issue by applying heteroscedasticity-consistent standard errors to the tests of parameters (Table 136). Both effects are still highly significant using robust standard errors so our conclusions don’t change.

check_model(cloak_lm)
Figure 57
model_parameters(cloak_lm, vcov = "HC4") |> 
  display()
Table 136
Parameter Coefficient SE 95% CI t(77) p
(Intercept) 7.68 0.46 (6.77, 8.60) 16.74 < .001
mischief pre 0.25 0.08 (0.08, 0.42) 2.91 0.005
cloak (Cloak) 1.34 0.39 (0.56, 2.13) 3.41 0.001

Chapter 13

Task 13.1

People have claimed that listening to heavy metal, because of its aggressive sonic palette and often violent or emotionally negative lyrics, leads to angry and aggressive behaviour (Selfhout et al., 2008). As a very non-violent metal fan this accusation bugs me. Imagine I designed a study to test this possibility. I took groups of self-classifying metalheads and non-metalheads (fan) and assigned them randomly to listen to 15 minutes of either the sound of an angle grinder scraping a sheet of metal (control noise), metal music, or pop music (soundtrack). Each person rated their anger on a scale ranging from 0 (“All you need is love, da, da, da-da-da”) to 100 (“All you wanna do is drag me down, all I wanna do is stamp you out”). Fit a model to test my idea (metal.csv).

Load the data using (see Tip 2):

metal_tib <- discovr::metal

Figure 58 shows that after listening to an angle grinder both groups are angry. After listening to metal, the metal heads score low on anger but the pop fans score high. The reverse is true after listening to pop music.

ggplot(metal_tib, aes(x = soundtrack, y = anger, colour = fan)) +
  geom_violin(alpha = 0.5) +
  stat_summary(fun.data = "mean_cl_normal", position = position_dodge(width = 0.9)) +
  scale_y_continuous(breaks = seq(0, 100, 10)) +
  scale_colour_viridis_d(begin = 0.3, end = 0.8) +
  labs(y = "Anger (out of 100)", x = "Soundtrack", colour = "Fan group") +
  theme_minimal()
Figure 58

Table 137 shows the model fitted using afex::aov_4(). The soundtrack × fan interaction was significant, F(2, 84) = 433.28, p < 0.001, indicating that the soundtrack combined with the type of fan significantly affected anger. This effect supersedes the main effects, so we’ll ignore them. Figure 58 shows that after listening to an angle grinder both groups are angry. After listening to metal, the metal heads score low on anger but the pop fans score high. The reverse is true after listening to pop music. We can test this formally with simple effects analysis. Table 138 shows (in combination with the means) that:

  • Anger levels did not differ in pop and metal fans when listening to an angle grinder, F(1, 84) = 0.30, p = 1.000.
  • When listening to metal music, anger was significantly higher in pop fans than metal fans, F(1, 84) = 431.55, p < 0.001.
  • When listening to pop music, anger was significantly higher in metal fans than pop fans, F(1, 84) = 434.80, p < 0.001.
metal_afx <- afex::aov_4(anger ~ soundtrack*fan + (1|id), data = metal_tib)
model_parameters(metal_afx) |> 
  display()
Table 137: ANOVA estimation for factorial designs using ‘afex’
Parameter Sum_Squares df Mean_Square F p
soundtrack 20469.76 2 10234.88 116.82 < .001
fan 6.40 1 6.40 0.07 0.788
soundtrack:fan 75919.27 2 37959.63 433.28 < .001
Residuals 7359.20 84 87.61

Anova Table (Type 3 tests)

metal_se <- estimate_contrasts(model = metal_afx,
                               contrast = "fan",
                               by = "soundtrack",
                               comparison = "joint",
                               p_adjust = "bonferroni")

model_parameters(metal_se) |> 
  display()
Table 138: Fixed Effects
Contrast soundtrack df1 df2 F p
fan Angle grinder 1 84 0.30 > .999
fan Metal 1 84 431.55 < .001
fan Pop 1 84 434.80 < .001

p-value adjustment method: Bonferroni

Task 13.2

Compute omega squared for the effects in Task 1 and report the results of the analysis.

The simplest way to add effect sizes is by including es_type = "omega", ci = 0.95 in model_parameters() as in

model_parameters(metal_afx, es_type = "omega", ci = 0.95) |> 
  display()
Table 139: ANOVA estimation for factorial designs using ‘afex’
Parameter Sum_Squares df Mean_Square F p Omega2 (partial) Omega2 95% CI
soundtrack 20469.76 2 10234.88 116.82 < .001 0.72 (0.64, 1.00)
fan 6.40 1 6.40 0.07 0.788 0.00 (0.00, 1.00)
soundtrack:fan 75919.27 2 37959.63 433.28 < .001 0.91 (0.88, 1.00)
Residuals 7359.20 84 87.61

Anova Table (Type 3 tests)

We could report (remember if you’re using APA format to drop the leading zeros before p-values and ω2, for example report p = .035 instead of p = 0.035).

Important Write it up!

-The main effect of soundtrack was significant, F(2, 84) = 116.82, p < 0.001, \(\hat{\omega}_p\) = 0.72 [0.64, 1.00] indicating that anger scores were significantly different across the three soundtracks. It seems that after listening to the angle grinder anger was higher than after both both pop and metal. The main effect of fan was not significant, F(1, 84) = 0.07, p = 0.788, \(\hat{\omega}_p\) = 0.00 [0.00, 1.00] indicating that anger scores were not significantly different overall between metal and pop fans. The effect was basically zero. The soundtrack × fan interaction was significant, F(2, 84) = 433.28, p < 0.001, \(\hat{\omega}_p\) = 0.91 [0.88, 1.00] indicating that the soundtrack combined with the type of fan significantly affected anger. Simple effects analysis showed that anger levels did not differ in pop and metal fans when listening to an angle grinder, F(1, 84) = 0.30, p = 1.000, but when listening to metal music, anger was significantly higher in pop fans than metal fans, F(1, 84) = 431.55, p < 0.001, and when listening to pop music, anger was significantly higher in metal fans than pop fans, F(1, 84) = 434.80, p < 0.001.

Task 13.3

In Chapter 5 we used some data that related to male and female arousal levels when watching The Notebook or a documentary about notebooks (notebook.csv). Fit a model to test whether men and women differ in their reactions to different types of films.

Load the data using (see Tip 2):

notebook_tib <- discovr::notebook

Table 140 shows the model fitted using afex::aov_4(). The main effect of gender identity is significant, F(1, 36) = 7.29, p = 0.011, as is the main effect of film, F(1, 36) = 141.87, p < 0.001 and the interaction, F(1, 36) = 4.64, p = 0.038. The interaction supersedes the main effects, so we’ll ignore them. Figure 59 shows that psychological arousal is very similar for self-identifying men and women during the documentary about notebooks (it is low for both). However, for the notebook, self-identifying men experienced greater psychological arousal than self-identifying women. The interaction is likely to reflect that there is a difference between men and women for one type of film (the notebook) but not the other (the documentary about notebooks). This is confirmed by the simple effects analysis, which shows (Table 141)

  • Arousal levels were significantly higher in self-identifying men when watching the film ‘The Notebook’, F(1, 36) = 11.78, p = 0.003.
  • Arousal levels did not significantly differ between men and women when watching the documentary about notebooks, F(1, 36) = 0.15, p = 1.000.
notebook_afx <- afex::aov_4(arousal ~ film*gender_identity + (1|id), data = notebook_tib)
model_parameters(notebook_afx) |> 
  display()
Table 140: ANOVA estimation for factorial designs using ‘afex’
Parameter Sum_Squares df Mean_Square F p
film 5784.02 1 5784.02 141.87 < .001
gender_identity 297.02 1 297.02 7.29 0.011
film:gender_identity 189.22 1 189.22 4.64 0.038
Residuals 1467.70 36 40.77

Anova Table (Type 3 tests)

ggplot(notebook_tib, aes(x = film, y = arousal, colour = gender_identity, shape = gender_identity)) +
  geom_point(position = position_jitterdodge(dodge.width = 0.2, jitter.width = 0.2), alpha = 0.5) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.2)) +
  coord_cartesian(ylim = c(0, 50)) +
  scale_y_continuous(breaks = seq(0, 50, 10)) +
  scale_colour_viridis_d(begin = 0.3, end = 0.85) +
  labs(x = "Film", y = "Arousal", colour = "Gender identity", shape = "Gender identity") +
  theme_minimal()
Figure 59
notebook_se <- estimate_contrasts(model = notebook_afx,
                               contrast = "gender_identity",
                               by = "film",
                               comparison = "joint",
                               p_adjust = "bonferroni")

model_parameters(notebook_se) |> 
  display()
Table 141: Fixed Effects
Contrast film df1 df2 F p
gender_identity The Notebook 1 36 11.78 0.003
gender_identity A documentary about notebooks 1 36 0.15 > .999

p-value adjustment method: Bonferroni

Task 13.4

Compute omega squared for the effects in Task 3 and report the results of the analysis.

The simplest way to add effect sizes is by including es_type = "omega", ci = 0.95 in model_parameters() as in

model_parameters(notebook_afx, es_type = "omega", ci = 0.95) |> 
  display()
Table 142: ANOVA estimation for factorial designs using ‘afex’
Parameter Sum_Squares df Mean_Square F p Omega2 (partial) Omega2 95% CI
film 5784.02 1 5784.02 141.87 < .001 0.78 (0.67, 1.00)
gender_identity 297.02 1 297.02 7.29 0.011 0.14 (0.01, 1.00)
film:gender_identity 189.22 1 189.22 4.64 0.038 0.08 (0.00, 1.00)
Residuals 1467.70 36 40.77

Anova Table (Type 3 tests)

We could report (remember if you’re using APA format to drop the leading zeros before p-values and ω2, for example report p = .035 instead of p = 0.035).

Important Write it up!

The results show that the psychological arousal during the films was significantly higher for males than females, F(1, 36) = 7.29, p = 0.011, \(\hat{\omega}_p\) = 0.14 [0.01, 1.00]. Psychological arousal was also significantly higher during The Notebook than during a documentary about notebooks, F(1, 36) = 141.87, p < 0.001, \(\hat{\omega}_p\) = 0.78 [0.67, 1.00]. The interaction was also significant, F(1, 36) = 4.64, p = 0.038, \(\hat{\omega}_p\) = 0.08 [0.00, 1.00]. Simple effects analysis revealed that arousal levels were significantly higher in self-identifying men than women when watching the film ‘The Notebook’, F(1, 36) = 11.78, p = 0.003, but not when watching the documentary about notebooks, F(1, 36) = 0.15, p = 1.000.

Task 13.5

In Chapter 5 we used some data that related to learning in men and women when either reinforcement or punishment was used in teaching (teach_method.csv). Analyse these data to see whether men and women’s learning differs according to the teaching method used.

Load the data using (see Tip 2):[^I rename method to teaching_method because at the time of writing a variable name called method seems to really mess up estimate_contrasts().]

teach_tib <- discovr::teaching |> 
  rename(
    teaching_method = method
  )

Table 143 shows the model fitted using afex::aov_4(). The main effect of sex was significant, F(1, 16) = 12.25, p = 0.003, but the main effect of the teaching method was not, F(1, 16) = 2.25, p = 0.153. The interaction was significant, F(1, 16) = 30.25, p < 0.001. The interaction supersedes the main effects, so we’ll ignore them. Figure 60 and the simple effects analysis (Table 144) shows that

  • For women, marks were significantly higher when the teacher was nice than when they used punishment, F(1, 16) = 8.00, p = 0.024.
  • For men the reverse was true, marks were significantly higher when the teacher used punishment than when they were nice, F(1, 16) = 24.50, p < 0.001.
teach_afx <- afex::aov_4(mark ~ teaching_method*sex + (1|id), data = teach_tib)
model_parameters(teach_afx) |> 
  display()
Table 143: ANOVA estimation for factorial designs using ‘afex’
Parameter Sum_Squares df Mean_Square F p
teaching_method 11.25 1 11.25 2.25 0.153
sex 61.25 1 61.25 12.25 0.003
teaching_method:sex 151.25 1 151.25 30.25 < .001
Residuals 80.00 16 5.00

Anova Table (Type 3 tests)

ggplot(teach_tib, aes(x = sex, y = mark, colour = teaching_method, shape = teaching_method)) +
  geom_point(position = position_jitterdodge(dodge.width = 0.2, jitter.width = 0.2), alpha = 0.5) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.2)) +
  coord_cartesian(ylim = c(0, 22)) +
  scale_y_continuous(breaks = seq(0, 22, 2)) +
  scale_colour_viridis_d(begin = 0.3, end = 0.85) +
  labs(x = "Sex", y = "Mark (%)", colour = "Teaching method", shape = "Teaching method") +
  theme_minimal()
Figure 60
teach_se <- estimate_contrasts(model = teach_afx,
                               contrast = "sex",
                               by = "teaching_method",
                               comparison = "joint",
                               p_adjust = "bonferroni")

model_parameters(teach_se) |> 
  display()
Table 144: Fixed Effects
Contrast teaching_method df1 df2 F p
sex Electric shock 1 16 40.50 < .001
sex Being nice 1 16 2.00 0.353

p-value adjustment method: Bonferroni

Task 13.6

At the start of this Chapter I described a way of empirically researching whether I wrote better songs than my old bandmate Malcolm, and whether this depended on the type of song (a symphony or song about flies). The outcome variable was the number of screams elicited by audience members during the songs. Plot the data and fit a model to test my hypothesis that the type of song moderates which songwriter is preferred (escape.csv).

Load the data using (see Tip 2):

escape_tib <- discovr::escape

Table 145 shows the model fitted using afex::aov_4(). The main effect of the type of song was significant, F(1, 64) = 20.87, p < 0.001, as was the main effect of the songwriter, F(1, 64) = 9.94, p = 0.002. The interaction was also significant, F(1, 64) = 5.07, p = 0.028. The interaction supersedes the main effects, so we’ll ignore them. Figure 61 and the simple effects analysis (Table 146) shows that for songs about a fly the number of screams were comparable between songwriters, F(1, 64) = 0.41, p = 1.000, but for symphonies there were significantly more screams when Andy wrote the song than when Malcolm did, F(1, 64) = 14.61, p < 0.001.

escape_afx <- afex::aov_4(screams ~ song_type*songwriter + (1|id), data = escape_tib)
model_parameters(escape_afx) |> 
  display()
Table 145: ANOVA estimation for factorial designs using ‘afex’
Parameter Sum_Squares df Mean_Square F p
song_type 74.13 1 74.13 20.87 < .001
songwriter 35.31 1 35.31 9.94 0.002
song_type:songwriter 18.01 1 18.01 5.07 0.028
Residuals 227.29 64 3.55

Anova Table (Type 3 tests)

ggplot(escape_tib, aes(x = song_type, y = screams, colour = songwriter, shape = songwriter)) +
  geom_point(position = position_jitterdodge(dodge.width = 0.2, jitter.width = 0.2), alpha = 0.5) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.2)) +
  coord_cartesian(ylim = c(0, 15)) +
  scale_y_continuous(breaks = seq(0, 15, 1)) +
  scale_colour_viridis_d(begin = 0.3, end = 0.85) +
  labs(x = "Type of song", y = "Number of screams", colour = "Songwriter", shape = "Songwriter") +
  theme_minimal()
Figure 61
escape_se <- estimate_contrasts(model = escape_afx,
                               contrast = "songwriter",
                               by = "song_type",
                               comparison = "joint",
                               p_adjust = "bonferroni")

model_parameters(escape_se) |> 
  display()
Table 146: Fixed Effects
Contrast song_type df1 df2 F p
songwriter Symphony 1 64 14.61 < .001
songwriter Fly song 1 64 0.41 > .999

p-value adjustment method: Bonferroni

Task 13.7

Compute omega squared for the effects in Task 6 and report the results of the analysis.

The simplest way to add effect sizes is by including es_type = "omega", ci = 0.95 in model_parameters() as in

model_parameters(escape_afx, es_type = "omega", ci = 0.95) |> 
  display()
Table 147: ANOVA estimation for factorial designs using ‘afex’
Parameter Sum_Squares df Mean_Square F p Omega2 (partial) Omega2 95% CI
song_type 74.13 1 74.13 20.87 < .001 0.23 (0.09, 1.00)
songwriter 35.31 1 35.31 9.94 0.002 0.12 (0.02, 1.00)
song_type:songwriter 18.01 1 18.01 5.07 0.028 0.06 (0.00, 1.00)
Residuals 227.29 64 3.55

Anova Table (Type 3 tests)

We could report (remember if you’re using APA format to drop the leading zeros before p-values and ω2, for example report p = .035 instead of p = 0.035).

Important Write it up!

The main effect of the type of song significantly affected screams elicited during that song, F(1, 64) = 20.87, p < 0.001, \(\hat{\omega}_p\) = 0.23 [0.09, 1.00]; the two symphonies elicited significantly more screams of agony than the two songs about flies. The main effect of the songwriter significantly affected screams elicited during that song, F(1, 64) = 9.94, p = 0.002, \(\hat{\omega}_p\) = 0.12 [0.02, 1.00]; Andy’s songs elicited significantly more screams of torment from the audience than Malcolm’s songs. The song type \(\times\) songwriter interaction was significant, F(1, 64) = 5.07, p = 0.028, \(\hat{\omega}_p\) = 0.06 [0.00, 1.00]. Simple effects analysis showed that although reactions to Malcolm’s and Andy’s songs were similar for songs about a fly, F(1, 64) = 0.41, p = 1.000, Andy’s symphony elicited significantly more screams of torment than Malcolm’s, F(1, 64) = 14.61, p < 0.001. This effect was trivially small.

Task 13.8

There are reports of increases in injuries related to playing games consoles. These injuries were attributed mainly to muscle and tendon strains. A researcher hypothesized that a stretching warm-up before playing would help lower injuries, and that athletes would be less susceptible to injuries because their regular activity makes them more flexible. She took 60 athletes and 60 non-athletes (athlete); half of them played on a Nintendo Switch and half watched others playing as a control (switch), and within these groups half did a 5-minute stretch routine before playing/watching whereas the other half did not (stretch). The outcome was a pain score out of 10 (where 0 is no pain, and 10 is severe pain) after playing for 4 hours (injury). Fit a model to test whether athletes are less prone to injury, and whether the prevention programme worked (switch.csv).

Load the data using (see Tip 2):

switch_tib <- discovr::switch

Table 148 shows the model fitted using afex::aov_4().

Important Write it up!

There were significant main effect of being an athlete, F(1, 112) = 64.82, p < 0.001, engaging in stretching, F(1, 112) = 11.05, p = 0.001, and of playing or watching the Switch game, F(1, 112) = 55.66, p < 0.001. These main effects were qualified by significant athlete \(\times\) switch, F(1, 112) = 45.18, p < 0.001, and switch \(\times\) stretch, F(1, 112) = 14.19, p < 0.001 interactions. The athlete \(\times\) switch interaction was not significant, F(1, 112) = 1.23, p = 0.270. All of these effects were superseded by a significant athlete by stretch by switch interaction, F(1, 112) = 5.94, p = 0.016. This means that the effect of stretching and playing on the switch on injury score was different for athletes than it was for non-athletes. Figure 62 shows that for athletes, stretching and playing on the switch has very little effect: their mean injury score is quite stable across the two conditions (whether they played on the switch or watched other people playing on the switch, stretched or did no stretching). However, for the non-athletes, watching other people play on the switch compared to not stretching and playing on the switch rapidly declines their mean injury score. The interaction tells us that stretching and watching rather than playing on the switch both result in a lower injury score and that this is true only for non-athletes. In short, the results show that athletes are able to minimize their injury level regardless of whether they stretch before exercise or not, whereas non-athletes only have to bend slightly and they get injured!

switch_afx <- afex::aov_4(injury ~ athlete*switch*stretch* + (1|id), data = switch_tib)
model_parameters(switch_afx) |> 
  display()
Table 148: ANOVA estimation for factorial designs using ‘afex’
Parameter Sum_Squares df Mean_Square F p
athlete 99.01 1 99.01 64.82 < .001
switch 85.01 1 85.01 55.66 < .001
stretch 16.88 1 16.88 11.05 0.001
athlete:switch 69.01 1 69.01 45.18 < .001
athlete:stretch 1.88 1 1.88 1.23 0.270
switch:stretch 21.68 1 21.68 14.19 < .001
athlete:switch:stretch 9.08 1 9.08 5.94 0.016
Residuals 171.07 112 1.53

Anova Table (Type 3 tests)

ggplot(switch_tib, aes(switch, injury, colour = stretch)) + 
  geom_point(position = position_jitterdodge(dodge.width = 0.75, jitter.width = 0.2), size = 1, alpha = 0.2) +
  stat_summary(fun.data = "mean_cl_normal", position = position_dodge(width = 0.75)) + 
  facet_wrap(~athlete) +
  labs(x = "Playing or watching", y = "Injury score (out of 10)", colour = "Stretching before play") +
  discovr::scale_color_prayer() +
  coord_cartesian(ylim = c(0, 10)) + 
  scale_y_continuous(breaks = seq(0, 10, 2)) + 
  theme_minimal() +
  theme(axis.text.x = element_text(size = rel(0.8)))
Figure 62

Task 13.9

A researcher was interested in what factors contributed to injuries resulting from game console use. She tested 40 participants who were randomly assigned to either an active or static game played on either a Nintendo Switch or Xbox One Kinect. At the end of the session their physical condition was evaluated on an injury severity scale. The data are in the file xbox.csv which contains the variables game (0 = static, 1 = active), console (0 = Switch, 1 = Xbox), and injury (a score ranging from 0 (no injury) to 20 (severe injury)). Fit a model to see whether injury severity is significantly predicted from the type of game, the type of console and their interaction

Load the data using (see Tip 2):

xbox_tib <- discovr::xbox

Table 149 shows the model fitted using afex::aov_4(). The main effect of the type of game was significant, F(1, 36) = 25.86, p < 0.001, as was the main effect of the console, F(1, 36) = 3.58, p = 0.067. The interaction was also significant, F(1, 36) = 5.05, p = 0.031. The interaction supersedes the main effects, so we’ll ignore them. Figure 63 and the simple effects analysis (Table 150) shows that for active games there were significantly more injuries on the Switch than the Xbox, F(1, 36) = 8.57, p = 0.012, but for static games the number of injuries were comparable in the two consoles, F(1, 36) = 0.06, p = 1.000.

xbox_afx <- afex::aov_4(injury ~ game*console + (1|id), data = xbox_tib)
model_parameters(xbox_afx) |> 
  display()
Table 149: ANOVA estimation for factorial designs using ‘afex’
Parameter Sum_Squares df Mean_Square F p
game 184.90 1 184.90 25.86 < .001
console 25.60 1 25.60 3.58 0.067
game:console 36.10 1 36.10 5.05 0.031
Residuals 257.40 36 7.15

Anova Table (Type 3 tests)

ggplot(xbox_tib, aes(x = game, y = injury, colour = console, shape = console)) +
  geom_point(position = position_jitterdodge(dodge.width = 0.2, jitter.width = 0.2), alpha = 0.5) +
  stat_summary(fun.data = "mean_cl_normal", geom = "pointrange", position = position_dodge(width = 0.2)) +
  coord_cartesian(ylim = c(0, 20)) +
  scale_y_continuous(breaks = seq(0, 20, 1)) +
  scale_colour_viridis_d(begin = 0.3, end = 0.85) +
  labs(x = "Type of game", y = "Number of injuries", colour = "Console", shape = "Console") +
  theme_minimal()
Figure 63
xbox_se <- estimate_contrasts(model = xbox_afx,
                               contrast = "console",
                               by = "game",
                               comparison = "joint",
                               p_adjust = "bonferroni")

model_parameters(xbox_se) |> 
  display()
Table 150: Fixed Effects
Contrast game df1 df2 F p
console Active 1 36 8.57 0.012
console Static 1 36 0.06 > .999

p-value adjustment method: Bonferroni

Chapter 14

Task 14.1

Using the cosmetic surgery example, run the analysis described in Section 14.6.6 but also including BDI as a fixed effect predictor. What differences does including this predictor make?

Load the data using (see Tip 2):

cosmetic_tib <- discovr::cosmetic

To fit the model follow the instructions in section 14.6 of the book except that the model itself should include months, base_qoL, reason and bdi in the list of predictors. If you haven’t done so already create the variable months as described in Section 14.6.5.

cosmetic_tib <- cosmetic_tib |>
  mutate(
    months = days*12/365
    )

Fit the model by adapting the code in Section 14.6.7. In that section we build the model up sequentially and compared the models at each step. Therefore we can take the final model (cosmetic_mlm) as our starting point (because we’ve already seen the steps leading up to this model) and then update this model to also include bdi as a predictor. We can then use test_lrt() to compare the models. Table 151 shows that adding bdi significantly improves the fit of the model, \(\chi^2\)(1) = 55.49, p < 0.001. Table 152 also shows that including bdi explains an additional 1.02% of the variance.

Looking at the fixed effects (Table 153) bdi, significantly predicted quality of life after surgery, \(\hat{b}\) = 0.18 (0.13, 0.23), z = 7.52, p < 0.001. Including bdi has not affected the interpretation of the other variables in the model. Obviously the parameter estimates themselves have changed, but how we’d interpret them is unchanged except that we’re now interpreting each one at average levels of depression. Crucially the months by reason interaction is still significant. We can follow this up using the same code as in the chapter to obtain the simple slopes in Table 154. For those who had surgery to change their appearance, their quality of life increased over time but not significantly so, \(\hat{b}\) = 0.47 (-0.29, 1.23), z = 1.21, p = 0.227. In contrast, for those who had surgery to help with a physical problem, quality of life significantly increased over time, \(\hat{b}\) = 0.91 (0.15, 1.68), z = 2.33, p = 0.020. In essence, the inclusion of BDI has made very little difference to the conclusions or parameter estimates for the substantive interaction.

cosmetic_mlm <- glmmTMB::glmmTMB(post_qol ~ months*reason + base_qol + (months|clinic), data = cosmetic_tib)
bdi_mlm <- update(cosmetic_mlm, .~. + bdi)

test_lrt(cosmetic_mlm, bdi_mlm) |> 
  display()
Table 151: Likelihood-Ratio-Test (LRT) for Model Comparison (ML-estimator)
Name Model df df_diff Chi2 p
cosmetic_mlm glmmTMB 9
bdi_mlm glmmTMB 10 1 55.49 < .001
compare_performance(cosmetic_mlm, bdi_mlm) |> 
  display()
Table 152
Name Model AIC (weights) AICc (weights) BIC (weights) R2 (cond.) R2 (marg.) ICC RMSE Sigma
cosmetic_mlm glmmTMB 11667.1 (<.001) 11667.2 (<.001) 11715.4 (<.001) 0.71 0.09 0.68 9.13 9.25
bdi_mlm glmmTMB 11613.6 (>.999) 11613.8 (>.999) 11667.3 (>.999) 0.72 0.10 0.69 8.97 9.08
model_parameters(bdi_mlm) |> 
  display()
Table 153: # Random Effects
Parameter Coefficient SE 95% CI z p
(Intercept) 22.01 3.94 (14.28, 29.73) 5.58 < .001
months 0.47 0.39 (-0.29, 1.23) 1.21 0.227
reason [Physical reason] -1.87 0.96 (-3.75, 0.01) -1.95 0.051
base qol 0.47 0.02 (0.43, 0.52) 19.99 < .001
bdi 0.18 0.02 (0.13, 0.23) 7.52 < .001
months × reason [Physical reason] 0.45 0.13 (0.20, 0.70) 3.49 < .001
Parameter Coefficient
SD (Intercept: clinic) 16.63
SD (months: clinic) 1.69
Cor (Intercept~months: clinic) -0.71
SD (Residual) 9.08
estimate_slopes(bdi_mlm,
                trend = "months",
                by = "reason",
                ci = 0.95) |> 
  display()
Table 154: Estimated Marginal Effects
reason Slope SE 95% CI z p
Change appearance 0.47 0.39 (-0.29, 1.23) 1.21 0.227
Physical reason 0.91 0.39 ( 0.15, 1.68) 2.33 0.020

Marginal effects estimated for months; Type of slope was dY/dX

Task 14.2

Miller et al. (2007) tested the ‘hidden-estrus’ theory, which suggests that unlike other female mammals, humans do not experience an ‘estrus’ phase during which they are more sexually receptive, proceptive, selective and attractive. If this theory is wrong then human heterosexual men should find women most attractive during the fertile phase of their menstrual cycle compared to the pre-fertile (menstrual) and post-fertile (luteal) phase. Miller used the tips obtained by dancers (id) at a lap dancing club as a proxy for their sexual attractiveness and also recorded the phase of the dancer’s menstrual cycle during a given shift (cyclephase), and whether they were using hormonal contraceptives (contraceptive). Dancers provided data from between 9 and 29 of their shifts. Fit a multilevel model using these data (miller_2007.csv) to see whether tips can be predicted from cyclephase, contraceptive and their interaction, allowing the overall level of tips to vary across dancers. Is the ‘hidden-estrus’ hypothesis supported?

Load the data using (see Tip 2):

miller_tib <- discovr::miller_2007

In the model that Miller et al. (2007) fitted, they did not assume that there would be random slopes (i.e., the relationship between each predictor and tips was not assumed to vary within lap dancers). This decision is appropriate for contraceptive because this variable didn’t vary at level 2 (the lap dancer was either taking contraceptives or not, so this could not be set up as a random effect because it doesn’t vary over our level 2 variable of participant). Also, because cyclephase is a categorical variable with three unordered categories we could not expect a linear relationship with tips: we expect tips to vary over categories but the categories themselves have no meaningful order. However, we might expect tips to vary over participants (some lap dancers will naturally get more money than others) and we can factor this variability in by allowing the intercept to be random. As such, we’re fitting a random intercept model to the data.

The substantive hypothesis is around an interaction between cyclephase and contraceptive and to get at this we need the main effects in the model, so let’s fit the full model (rather than building up sequentially) and look at the F tests based on Type III sums of squares, as shown in Table 155. A few things to note:

  • The authors report in the paper that they used restricted maximum-likelihood estimation (REML), so I have included REML = TRUE to override the default of maximum likelihood estimation.
  • If using Type III sums of squares we need categorical variables to have orthogonal contrasts. We can achieve this by manually setting a contrast for cyclephase and contraceptive using contrasts() (which is covered in the book). However, we can also do this directly within glmmTMB() using the contrasts argument. Note I have included contrasts = list(contraceptive = "contr.sum", cyclephase = "contr.sum"). This sets a contrast for each variable included within list(), we can use any built in contrast, but I have chosen "contr.sum" for both variables, which is an orthogonal contrast.
Important Write it up!

Miller et al. reported these results as follows:

“Main effects of cycle phase [F(2, 236) = 27.46, p < .001] and contraception use [F(1, 17) = 6.76, p = .019] were moderated by an interaction between cycle phase and pill use [F(2, 236) = 5.32, p = .005] (p. 378).”

Sadly, because R uses a chi-square statistic (not F) we can’t directly map Table 155 to these values.

Table 156 shows that about 51% of the variance in tips is explained by the model. Table 155 shows that the phase of the dancer’s cycle significantly predicted tip income, \(\Chi^2\)(2) = 54.90, p < 0.001 and this interacted with whether or not the dancer was having natural cycles or was on the contraceptive pill, \(\Chi^2\)(2) = 10.60, p = 0.005. Table 155 also tells us that standard deviation in tips across dancers was 59.76 95% CI [39.51, 90.39]. In other words, the average tip per dancer varied substantially confirming the decision to treat the intercept as a random variable.

To tease the interaction apart we can use the contrasts in Table 158. A few things to note about the code:

  • The built in contrast "trt.vs.ctrlk" compares each condition to the kth condition (e.g. the last). So cyclephase = "trt.vs.ctrlk" asks for each category of cyclephase to be compared to the last category. This variable is coded such that Fertile is the last category which means this contrast compares every category to Fertile. I have done this because the authors predictions stated that the fertile phase should differ from the other phases: “human heterosexual men should find women most attractive during the fertile phase of their menstrual cycle compared to the pre-fertile (menstrual) and post-fertile (luteal) phase”
  • The built in contrast "trt.vs.ctrl1" compares each condition to the 1st condition. It’s the same as setting ref = 1. So cyclephase = "trt.vs.ctrlk" asks for each category of contraceptive to be compared to the first category. In fact there are only 2 categories so this is as good a comparison as any.

Table 158 shows these contrasts. The first contrast tells us the following: if we worked out the relative difference in tips between the fertile phase and the Luteal phase, how much more do those on contraceptive pill earn compared to those in their natural cycle? The answer is about $86 (the \(\hat{b}\)). In other words, there is a combined effect of being in a natural cycle and being in the fertile phase compared to the menstrual phase, and this is significant, \(\bar{X}_\text{Diff}\) = 86.09 (26.81, 145.36), t(247) = 2.86, p = 0.005. The second contrast tells us the following: if we worked out the relative difference in tips between the fertile phase and the menstrual phase, how much more do those in their natural cycle earn compared to those on contraceptive pills? The answer is about $90. In other words, there is a combined effect of being on the pill (relative to in a natural cycle) and being in the fertile phase (relative to the Menstrual phase), and this is significant, \(\bar{X}_\text{Diff}\) = 89.94 (22.63, 157.25), t(247) = 2.63, p = 0.009.

To conclude, then, this study showed that the ‘estrus-hidden’ hypothesis is wrong: men did find women more attractive (as indexed by how many lap dances a woman performed and therefore how much she earned) during the fertile phase of their cycle compared to the other phases.

miller_mlm <- glmmTMB::glmmTMB(tips ~ contraceptive*cyclephase + (1|id),
                               data = miller_tib,
                               contrasts = list(contraceptive = "contr.sum", cyclephase = "contr.sum"),
                               REML = TRUE) 

car::Anova(miller_mlm, type = 3) |>
  model_parameters() |>
  display(use_symbols = TRUE)
Table 155
Parameter χ² df p
contraceptive 6.75 1 0.009
cyclephase 54.90 2 < .001
contraceptive:cyclephase 10.60 2 0.005

Anova Table (Type 3 tests)

model_performance(miller_mlm) |> 
  display()
Table 156
AIC AICc BIC R2 (cond.) R2 (marg.) ICC RMSE Sigma
3039.8 3040.4 3068.2 0.51 0.30 0.29 89.28 92.90
model_parameters(miller_mlm) |> 
  display()
Table 157: # Random Effects
Parameter Coefficient SE 95% CI z p
(Intercept) 224.76 15.96 (193.48, 256.03) 14.08 < .001
contraceptive1 41.47 15.96 (10.19, 72.76) 2.60 0.009
cyclephase1 -64.80 10.29 (-84.98, -44.63) -6.30 < .001
cyclephase2 3.72 9.17 (-14.25, 21.69) 0.41 0.685
contraceptive1 × cyclephase1 -15.63 10.29 (-35.81, 4.55) -1.52 0.129
contraceptive1 × cyclephase2 -13.71 9.17 (-31.68, 4.27) -1.49 0.135
Parameter Coefficient 95% CI
SD (Intercept: id) 59.76 (39.51, 90.39)
SD (Residual) 92.90 (84.84, 101.74)
estimate_contrasts(miller_mlm,
                   contrast = c("cyclephase", "contraceptive"),
                   interaction = c(cyclephase = "trt.vs.ctrlk", contraceptive = "trt.vs.ctrl1"),
                   p_adjust = "none",
                   backend = "emmeans") |> 
  display()
Table 158: Marginal Contrasts Analysis
cyclephase_trt.vs.ctrlk contraceptive_trt.vs.ctrl1 Difference 95% CI SE t(247) p
Luteal - Fertile Using contraceptives - Not using contraceptives 86.09 (26.81, 145.36) 30.09 2.86 0.005
Menstural - Fertile Using contraceptives - Not using contraceptives 89.94 (22.63, 157.25) 34.17 2.63 0.009

Variable predicted: tips; Predictors contrasted: cyclephase, contraceptive; p-values are uncorrected.

Task 14.3

Hill et al. (2007) examined whether providing children with a leaflet based on the ‘theory of planned behaviour’ increased their exercise. There were four different interventions (intervention): a control group, a leaflet, a leaflet and quiz, and a leaflet and a plan. A total of 503 children from 22 different classrooms were sampled (classroom). The 22 classrooms were randomly assigned to the four different conditions. Children were asked ‘On average over the last three weeks, I have exercised energetically for at least 30 minutes ______ times per week’ after the intervention (post_exercise). Run a multilevel model analysis on these data (hill_2007.csv) to see whether the intervention affected the children’s exercise levels (the hierarchy is children within classrooms within interventions).

Load the data using (see Tip 2):

hill_tib <- discovr::hill_2007

Table 160 shows that about 8% of the variance in post-intervention exercise is explained by the intervention. However, Table 159 shows that post-intervention exercise was not significantly different across intervention groups, \(\chi^2\)(3) = 5.50, p = 0.139. In terms of whether these effects varied by classroom, the ICC (Table 160) was 5.6% which is fairly small and the standard deviation of post-intervention exercise across classrooms was 0.13 95% CI [0.08, 0.22]

The result from these data could be that the intervention failed to affect exercise. However, there is a lot of individual variability in the amount of exercise people get. A better approach would be to take into account the amount of self-reported exercise prior to the study as a covariate, which leads us to the next task.

hill_mlm <- glmmTMB::glmmTMB(post_exercise ~ intervention + (1|classroom), data = hill_tib) 
test_lrt(hill_mlm) |> 
  display()
Table 159: Likelihood-Ratio-Test (LRT) for Model Comparison (ML-estimator)
Name Model df df_diff Chi2 p
Null model glmmTMB 3
Full model glmmTMB 6 3 5.50 0.139
model_performance(hill_mlm) |> 
  display()
Table 160
AIC AICc BIC R2 (cond.) R2 (marg.) ICC RMSE Sigma
836.9 837.0 862.2 0.08 0.03 0.06 0.53 0.54
model_parameters(hill_mlm) |> 
  display()
Table 161: # Random Effects
Parameter Coefficient SE 95% CI z p
(Intercept) 1.65 0.07 (1.51, 1.79) 22.99 < .001
intervention [Leaflet] 0.19 0.10 (-0.02, 0.39) 1.80 0.071
intervention [Leaflet + Quiz] 0.14 0.10 (-0.06, 0.33) 1.34 0.181
intervention [Leaflet + Plan] 0.25 0.11 (0.04, 0.46) 2.37 0.018
Parameter Coefficient 95% CI
SD (Intercept: classroom) 0.13 (0.08, 0.22)
SD (Residual) 0.54 (0.51, 0.57)

Task 14.4

Repeat the analysis in Task 3 but include the pre-intervention exercise scores (pre_exercise) as a covariate. What difference does this make to the results?

Given we have already fitted a model for intervention that we now want to extend to include pre_exercise as a predictor, we can use update() to create the new model. Table 162 shows that adding pre_exercise to the model significantly improves the fit, \(\chi^2\)(1) = 449.20, p < 0.001. Table 163 shows that about 61% of the variance in post-intervention exercise is explained by the model, which is 53.39% more than the previous model - a huge increase!

In terms of whether these effects varied by classroom, the ICC (Table 163) was 1.4% which is fairly small and the standard deviation of post-intervention exercise across classrooms was 0.04 95% CI [0.01, 0.14].

Looking at the parameter estimates (Table 164) shows that the effect of pre-intervention exercise level is a significant predictor of post-intervention exercise level, \(\hat{b}\) = 0.72 (0.66, 0.77), z = 27.12, p < 0.001, and, most interestingly, the effects of intervention are all now significant. Specifically, at average levels of pre-intervention exercise compared to no intervention, the post-intervention exercise was significantly greater after a leaflet, \(\hat{b}\) = 0.16 (0.06, 0.26), z = 3.13, p = 0.002, after a leaflet and quiz, \(\hat{b}\) = 0.21 (0.11, 0.30), z = 4.16, p < 0.001, and after a leaflet and plan, \(\hat{b}\) = 0.22 (0.12, 0.32), z = 4.25, p < 0.001. These results show that when we adjust for the amount of self-reported exercise prior to the study, the intervention group becomes a significant predictor of post-intervention exercise levels.

hill_pre_mlm <- update(hill_mlm, .~. + pre_exercise)
test_lrt(hill_mlm, hill_pre_mlm) |> 
  display()
Table 162: Likelihood-Ratio-Test (LRT) for Model Comparison (ML-estimator)
Name Model df df_diff Chi2 p
hill_mlm glmmTMB 6
hill_pre_mlm glmmTMB 7 1 449.20 < .001
model_performance(hill_mlm) |> 
  display()
Table 163
AIC AICc BIC R2 (cond.) R2 (marg.) ICC RMSE Sigma
836.9 837.0 862.2 0.08 0.03 0.06 0.53 0.54
model_parameters(hill_mlm) |> 
  display()
Table 164: # Random Effects
Parameter Coefficient SE 95% CI z p
(Intercept) 1.65 0.07 (1.51, 1.79) 22.99 < .001
intervention [Leaflet] 0.19 0.10 (-0.02, 0.39) 1.80 0.071
intervention [Leaflet + Quiz] 0.14 0.10 (-0.06, 0.33) 1.34 0.181
intervention [Leaflet + Plan] 0.25 0.11 (0.04, 0.46) 2.37 0.018
Parameter Coefficient 95% CI
SD (Intercept: classroom) 0.13 (0.08, 0.22)
SD (Residual) 0.54 (0.51, 0.57)

Chapter 15

Task 15.1

It is common that lecturers obtain reputations for being ‘hard’ or ‘light’ markers, but there is often little to substantiate these reputations. A group of students investigated the consistency of marking by submitting the same essays to four different lecturers. The outcome was the percentage mark given by each lecturer and the predictor was the lecturer who marked the report (tutor_marks.csv). Compute the F-statistic for the effect of marker by hand.

Load the data using (see Tip 2):

tutor_tib <- discovr::tutor_marks

There were eight essays, each marked by four different lecturers. The data are in Table 165 The mean mark that each essay received and the variance of marks for a particular essay are shown too. Now, the total variance within essay marks will in part be due to different lecturers marking (some are more critical and some more lenient), and in part by the fact that the essays themselves differ in quality (individual differences). Our job is to tease apart these sources.

Table 165: Mean and variance of each teacher’s marks
Essay id Tutor 1 Tutor 2 Tutor 3 Tutor 4 \(\overline{X}_\text{essay}\) \(s_\text{essay}^2\)
qegggt 62 58 63 64 61.75 6.92
nghnol 63 60 68 65 64.00 11.33
gcomqu 65 61 72 65 65.75 20.92
hphjhp 68 64 58 61 62.75 18.25
jrcbpi 69 65 54 59 61.75 43.58
acnlxu 71 67 65 50 63.25 84.25
tenwth 78 66 67 50 65.25 132.92
unwyka 75 73 75 45 67.00 216.00

The total sum of squares

The \(\text{SS}_\text{T}\) is calculated as:

\[ \text{SS}_\text{T} = \sum_{i=1}^{N} (x_i-\overline{X})^2 \]

To use this equation we need the overall mean of all marks (regardless of the essay marked or who marked it). Table 166 shows descriptive statistics for all marks. The grand mean (the mean of all scores) is 63.94.

Table 166: Descriptive statistics for all scores
Mean SD IQR Min Max n
63.94 7.42 7.75 45 78 32

To get the total sum of squares, we take each mark, subtract from it the mean of all scores (63.94) and square this difference (that’s the \((x_i-\overline{X})^2\) in the equation) to get the squared errors. Table 167 shows this process. We then add these squared differences to get the sum of squared error:

\[ \begin{aligned} \text{SS}_\text{T} &= 3.76 + 35.28 + 0.88 + 0.00 + 0.88 + 15.52 + 16.48 + 1.12 + \\ &\quad 1.12 + 8.64 + 64.96 + 1.12 + 16.48 + 0.00 + 35.28 + 8.64 +\\ &\quad 25.60 + 1.12 + 98.80 + 24.40 + 49.84 + 9.36 + 1.12 + 194.32 +\\ &\quad 197.68 + 4.24 + 9.36 + 194.32 + 122.32 + 82.08 + 122.32 + 358.72 +\\ &= 1705.76 \end{aligned} \]

The degrees of freedom for this sum of squares is \(N–1\), or 31.

Table 167: Total sum of squared errors
Essay id Tutor Mark Mean Error (score - mean) Error squared
qegggt Tutor 1 62 63.94 -1.94 3.76
qegggt Tutor 2 58 63.94 -5.94 35.28
qegggt Tutor 3 63 63.94 -0.94 0.88
qegggt Tutor 4 64 63.94 0.06 0.00
nghnol Tutor 1 63 63.94 -0.94 0.88
nghnol Tutor 2 60 63.94 -3.94 15.52
nghnol Tutor 3 68 63.94 4.06 16.48
nghnol Tutor 4 65 63.94 1.06 1.12
gcomqu Tutor 1 65 63.94 1.06 1.12
gcomqu Tutor 2 61 63.94 -2.94 8.64
gcomqu Tutor 3 72 63.94 8.06 64.96
gcomqu Tutor 4 65 63.94 1.06 1.12
hphjhp Tutor 1 68 63.94 4.06 16.48
hphjhp Tutor 2 64 63.94 0.06 0.00
hphjhp Tutor 3 58 63.94 -5.94 35.28
hphjhp Tutor 4 61 63.94 -2.94 8.64
jrcbpi Tutor 1 69 63.94 5.06 25.60
jrcbpi Tutor 2 65 63.94 1.06 1.12
jrcbpi Tutor 3 54 63.94 -9.94 98.80
jrcbpi Tutor 4 59 63.94 -4.94 24.40
acnlxu Tutor 1 71 63.94 7.06 49.84
acnlxu Tutor 2 67 63.94 3.06 9.36
acnlxu Tutor 3 65 63.94 1.06 1.12
acnlxu Tutor 4 50 63.94 -13.94 194.32
tenwth Tutor 1 78 63.94 14.06 197.68
tenwth Tutor 2 66 63.94 2.06 4.24
tenwth Tutor 3 67 63.94 3.06 9.36
tenwth Tutor 4 50 63.94 -13.94 194.32
unwyka Tutor 1 75 63.94 11.06 122.32
unwyka Tutor 2 73 63.94 9.06 82.08
unwyka Tutor 3 75 63.94 11.06 122.32
unwyka Tutor 4 45 63.94 -18.94 358.72
Total 1705.76

The within-participant sum of squares

The within-participant sum of squares, \(\text{SS}_\text{W}\), is calculated using the variance in marks for each essay, which are shown in Table 165. The ns are the number of scores on which the variances are based (i.e. in this case the number of marks each essay received, which was 4).

\[ \text{SS}_\text{W} = s_\text{essay 1}^2(n_1-1)+s_\text{essay 2}^2(n_2-1) + s_\text{essay 3}^2(n_3-1) +\ldots+ s_\text{essay 8}^2(n_8-1) \]

Using the values in in Table 165 we get

\[ \begin{aligned} \text{SS}_\text{W} &= s_\text{essay 1}^2(n_1-1)+s_\text{essay 2}^2(n_2-1) + s_\text{essay 3}^2(n_3-1) +\ldots+ s_\text{essay 8}^2(n_8-1) \\ &= 6.92(4-1) + 11.33(4-1) + 20.92(4-1) + 18.25(4-1) + \\ &\quad 43.58(4-1) + 84.25(4-1) + 132.92(4-1) + 216.00(4-1)\\ &= 1602.51. \end{aligned} \]

The degrees of freedom for each essay are \(n–1\) (i.e. the number of marks per essay minus 1). To get the total degrees of freedom we add the df for each essay

\[ \begin{aligned} \text{df}_\text{W} &= df_\text{essay 1}+df_\text{essay 2} + df_\text{essay 3} +\ldots+ df_\text{essay 8} \\ &= (4-1) + (4-1) + (4-1) + (4-1) + (4-1) + (4-1) + (4-1) + (4-1)\\ &= 24 \end{aligned} \]

A shortcut would be to multiply the degrees of freedom per essay (3) by the number of essays (8): \(3 \times 8 = 24\)

The model sum of squares

We calculate the model sum of squares \(\text{SS}_\text{M}\) as:

\[ \sum_{g = 1}^{k}n_g(\overline{x}_g-\overline{x}_\text{grand})^2 \]

Therefore, we need to subtract the mean of all marks (in Table 165) from the mean mark awarded by each tutor (in Table 168), then squares these differences, multiply them by the number of essays marked and sum the results.

Table 168: Mean mark (and variance) awarded by each tutor
tutor Mean Variance
Tutor 1 68.88 31.84
Tutor 2 64.25 22.21
Tutor 3 65.25 47.93
Tutor 4 57.38 62.55

Using the values in Table 168, \(\text{SS}_\text{M}\) is

\[ \begin{aligned} \text{SS}_\text{M} &= 8(68.88 – 63.94)^2 +8(64.25 – 63.94)^2 + 8(65.25 – 63.94)^2 + 8(57.38–63.94)^2\\ &= 554 \end{aligned} \] The degrees of freedom are the number of conditions (in this case the number of markers) minus 1, \(df_M = k-1 = 3\)

The residual sum of squares

We now know that there are 1706 units of variation to be explained in our data, and that the variation across our conditions accounts for 1602 units. Of these 1602 units, our experimental manipulation can explain 554 units. The final sum of squares is the residual sum of squares (\(\text{SS}_\text{R}\)), which tells us how much of the variation cannot be explained by the model. Knowing \(\text{SS}_\text{W}\) and \(\text{SS}_\text{M}\) already, the simplest way to calculate \(\text{SS}_\text{R}\) is through subtraction

\[ \begin{aligned} \text{SS}_\text{R} &= \text{SS}_\text{W}-\text{SS}_\text{M}\\ &=1602.51-554\\ &=1048.51. \end{aligned} \]

The degrees of freedom are calculated in a similar way

\[ \begin{aligned} df_\text{R} &= df_\text{W}-df_\text{M}\\ &= 24-3\\ &= 21. \end{aligned} \]

The mean squares

Next, convert the sums of squares to mean squares by dividing by their degrees of freedom

\[ \begin{aligned} \text{MS}_\text{M} &= \frac{\text{SS}_\text{M}}{df_\text{M}} = \frac{554}{3} = 184.67 \\ \text{MS}_\text{R} &= \frac{\text{SS}_\text{R}}{df_\text{R}} = \frac{1048.51}{21} = 49.93. \\ \end{aligned} \]

The F-statistic

The F-statistic is calculated by dividing the model mean squares by the residual mean squares:

\[ F = \frac{\text{MS}_\text{M}}{\text{MS}_\text{R}} = \frac{184.67}{49.93} = 3.70. \]

This value of F can be compared against a critical value based on its degrees of freedom (which are 3 and 21 in this case).

Task 2

Repeat the analysis for Task 1 using R and interpret the results.

We already loaded the data in Task 1.

Figure 64 shows the mean marks for each tutor as well as the distribution. Tutors 1 yo 3 have a lot of overlap in their distributions and have similar average marks, whereas Tutor 4 seems a little lower on average (although the distribution still overlaps a lot with other tutors).

ggplot(tutor_tib, aes(x = tutor, y = mark)) +
  geom_violin(colour = "#28BBECFF", fill = "#28BBECFF", alpha = 0.1) +
  stat_summary(fun.data = "mean_cl_normal", colour = "#28BBECFF") +
  coord_cartesian(ylim = c(0, 80)) +
  scale_y_continuous(breaks = seq(0, 80, 10)) +
  labs(y = "Mark (%)", x = "Marker") +
  theme_minimal()
Figure 64: Violin plot of essay marks by tutor

Table 169 shows the main summary table from the model. We would conclude that tutors did not significantly differ in the marks they awarded, F(1.67, 11.71) = 3.70, p = 0.063. In light of this non-significance, no follow up tests are necessary.

Caution Write it up!

Using Greenhouse-Geisser corrected degrees of freedom, there was no significant difference in the marks awarded by different tutors to the essays, F(1.67, 11.71) = 3.70, p = 0.063. However, this lack of significance most likely reflects the small sample size because the effect of markers on the marks awarded was relatively strong, \(\eta_p^2\) = 0.35.

tutor_afx <- afex::aov_4(mark ~ tutor + (tutor|id), data = tutor_tib)
model_parameters(tutor_afx) |>
  display()
Table 169: ANOVA estimation for factorial designs using ‘afex’
Parameter Sum_Squares Sum_Squares_Error df df (error) Mean_Square F p
tutor 554.13 1048.38 1.67 11.71 89.53 3.70 0.063

Anova Table (Type 3 tests)

Task 3

Calculate the effect sizes for the analysis in Task 1

The easiest way to do this is to add es_type = "omega" into model_parameters() so that the effect sizes are added to the summary table as in Table 170.

model_parameters(tutor_afx, es_type = "omega") |>
  display(use_symbols = TRUE)
Table 170: ANOVA estimation for factorial designs using ‘afex’
Parameter Sum_Squares Sum_Squares_Error df df (error) Mean_Square F p ω² (partial)
tutor 554.13 1048.38 1.67 11.71 89.53 3.70 0.063 0.24

Anova Table (Type 3 tests)

Alternatively, we could pass the model (tutor_afx) into omega_squared() using the code below

omega_squared(tutor_afx) |> 
  display()
Effect Size for ANOVA (Type III)
Parameter Omega2 (partial) 95% CI
tutor 0.24 [0.00, 1.00]

One-sided CIs: upper bound fixed at [1.00].

Either way we get the same results, which show a fairly substantial effect size, \(\hat{\omega}_p\) = 0.24 [0.00, 1.00]. In light of this, we might amend how we report this model.

Caution Write it up!

Using Greenhouse-Geisser corrected degrees of freedom, there was no significant difference in the marks awarded by different tutors to the essays, F(1.67, 11.71) = 3.70, p = 0.063. However, this lack of significance most likely reflects the small sample size because the effect of markers on the marks awarded was relatively strong, \(\hat{\omega}_p\) = 0.24 [0.00, 1.00].

Task 4

In the previous chapter we came across the beer-goggles effect. In that chapter, we saw that the beer-goggles effect was stronger for unattractive faces. We took a follow-up sample of 26 people and gave them doses of alcohol (0 pints, 2 pints, 4 pints and 6 pints of lager) over four different weeks. We asked them to rate a bunch of photos of unattractive faces in either dim or bright lighting. The outcome measure was the mean attractiveness rating (out of 100) of the faces and the predictors were the dose of alcohol and the lighting conditions (goggles_lighting.csv). Do alcohol dose and lighting interact to magnify the beer goggles effect?

Load the data using (see Tip 2):

goggle_tib <- discovr::goggles_lighting

Table 171 shows the main summary table from the model and Figure 65 shows the interaction plot. The plot shows a similar profile or ratings after no and 2 pints in the dim and light environments. However, at 4 and 6 pints the attractiveness ratings decline faster in dim lighting conditions than in bright environments.

From Table 171, the main effect of lighting shows that the attractiveness ratings of photos was significantly lower when the lighting was dim compared to when it was bright, F(1, 25) = 23.42, p < 0.001, \(\hat{\omega}^2_p\) = 0.25. The main effect of alcohol shows that the attractiveness ratings of photos of faces was significantly affected by how much alcohol was consumed, F(2.62, 65.47) = 104.39, p < 0.001, \(\hat{\omega}^2_p\) = 0.75. However, both of these effects are superseded by the interaction, which shows that the effect that alcohol had on ratings of attractiveness was significantly moderated by the brightness of the lighting, F(2.81, 70.23) = 22.22, p < 0.001, \(\hat{\omega}^2_p\) = 0.35. To interpret this effect let’s move onto the next task.

goggles_afx <- afex::aov_4(rating ~ lighting*alcohol + (lighting*alcohol|id), data = goggle_tib)
model_parameters(goggles_afx, es_type = "omega") |>
  display(use_symbols = TRUE)
Table 171: ANOVA estimation for factorial designs using ‘afex’
Parameter Sum_Squares Sum_Squares_Error df df (error) Mean_Square F p ω² (partial)
lighting 1993.92 2128.33 1.00 25.00 85.13 23.42 < .001 0.25
alcohol 38591.65 9242.60 2.62 65.47 141.18 104.39 < .001 0.75
lighting:alcohol 5765.42 6487.33 2.81 70.23 92.37 22.22 < .001 0.35

Anova Table (Type 3 tests)

afex::afex_plot(object = goggles_afx,
                x = "alcohol",
                trace = "lighting",
                legend_title = "Lighting condition",
                error = "within",
                mapping = c("shape", "color")) + 
  labs(x = "Dose of alcohol", y = "Attractiveness rating (0-100)") +
  scale_color_viridis_d(begin = 0.2, end = 0.85, option = "turbo") +
  theme_minimal()
Figure 65: Violin plot of essay marks by tutor

Task 5

Interpret the simple effect of alcohol at different levels of lighting.

The interaction effect is shown in Figure 65. Table 172 shows the simple effect of alcohol at different levels of lighting. These analyses are not particularly helpful because they show that alcohol significantly affected attractiveness ratings both when lights were dim, F(3, 25) = 169.88, p < 0.001, and when they were bright, F(3, 25) = 39.12, p < 0.001. This is an example where it might be worth looking at the alternative simple effects, that is, the simple effect of lighting within each dose of alcohol. These effects are shown in Table 173 and they are somewhat more useful because they show that lighting did not have a significant effect on attractiveness ratings after no alcohol, F(1, 25) = 1.49, p = 0.937, or after 2 pints of lager, F(1, 25) = 4.56, p = 0.171, but did have a significant effect after 4, F(1, 25) = 27.17, p < 0.001, and 6, F(1, 25) = 55.59, p < 0.001, pints. Basically the effect of lighting got stronger as the alcohol dose increased: you can see this on Figure 65 by the circle and triangle getting further apart as we move up the doses of alcohol.

estimate_contrasts(model = goggles_afx,
                   contrast = "alcohol",
                   by = "lighting",
                   comparison = "joint",
                   p_adjust = "bonferroni") |> 
  display()
Table 172: Marginal Joint Test
Contrast lighting df1 df2 F p
alcohol Dim 3 25 169.88 < .001
alcohol Bright 3 25 39.12 < .001

Predictors averaged: id; p-value adjustment method: Bonferroni

estimate_contrasts(model = goggles_afx,
                   contrast = "lighting",
                   by = "alcohol",
                   comparison = "joint",
                   p_adjust = "bonferroni") |> 
  display()
Table 173: Marginal Joint Test
Contrast alcohol df1 df2 F p
lighting None 1 25 1.49 0.937
lighting X2.pints 1 25 4.56 0.171
lighting X4.pints 1 25 27.17 < .001
lighting X6.pints 1 25 55.59 < .001

Predictors averaged: id; p-value adjustment method: Bonferroni

Caution Write it up!

The lighting by alcohol interaction was significant, F(2.81, 70.23) = 22.22, p < 0.001, \(\hat{\omega}^2_p\) = 0.35, indicating that the effect of alcohol on the ratings of the attractiveness of faces differed when lighting was dim compared to when it was bright. The simple effects of lighting within alcohol dose revealed that the effect of lighting on attractiveness ratings got stronger with alcohol dose. Specifically, lighting did not have a significant effect on attractiveness ratings after no alcohol, F(1, 25) = 1.49, p = 0.937, or 2 pints of lager, F(1, 25) = 4.56, p = 0.171, but did have a significant effect after 4, F(1, 25) = 27.17, p < 0.001, and 6, F(1, 25) = 55.59, p < 0.001, pints.

Task 6

Early in my career I looked at the effect of giving children information about entities. In one study (Field, 2006), I used three novel entities (the quoll, quokka and cuscus) and children were told negative things about one of the entities, positive things about another, and given no information about the third (our control). After the information I asked the children to place their hands in three wooden boxes each of which they believed contained one of the aforementioned entities (field_2006.sav). Draw and interpret error bar and Q-Q plots of the data.

Load the data using (see Tip 2):

field_tib <- discovr::field_2006

Figure 66 shows the mean latency to latency to approach each box. Children were sloer to approach the box they believed contained an animal about which they’d heard threat information. The Q-Q plots (Figure 67) show that latencies are very non-normal for all three boxes. This will be, in part, because if a child hadn’t approached the box within 15 seconds we (for ethical reasons) assumed that they did not want to complete the task and so assigned a score of 15 and asked them to approach the next box. These days, given these data, I’d use a robust test (Task 8!) back when I conducted these research these tests were not so readily available and I log-transformed the scores to reduce the skew (a practice I’m not keen on these days). This brings us onto Task 7!

ggplot(field_tib, aes(x = info_type, y = latency)) +
  stat_summary(fun.data = "mean_cl_normal", colour = "#28BBECFF") +
  coord_cartesian(ylim = c(0, 10)) +
  scale_y_continuous(breaks = seq(0, 10, 1)) +
  labs(y = "Latency to appoach (s)", x = "Information type") +
  theme_minimal()
Figure 66: Error bar plot of latency to approach the box by information condition
ggplot(field_tib, aes(sample = latency)) +
  qqplotr::stat_qq_band(fill = "#C4F134FF", alpha = 0.3) +
  qqplotr::stat_qq_line(colour = "#C4F134FF") +
  qqplotr::stat_qq_point(colour = "#3E9BFEFF", alpha = 0.5, size = 1) +
  labs(x = "Theoretical quantiles", y = "Sample quantiles") +
  facet_wrap(~info_type, scales = "free") +
  theme_minimal()
Figure 67: Q-Q plots of latency to approach the box by information condition

Task 7

Log-transform the scores in Task 7, make a Q-Q plot of the transformed scores and interpret it.

First we use mutate() to transform the latency scores.

field_tib <- field_tib |> 
  mutate(
    ln_latency = log(latency)
  )

We obtain the Q-Q plots using the same code except that we change the variable to be plotted from latency to ln_latency. The Q-Q plots (Figure 68) show that log latencies for all boxes are a lot more normally distributed than the raw latencies but the positive and no information latencies still show a lack of normality. This brings us onto Task 8!

ggplot(field_tib, aes(sample = ln_latency)) +
  qqplotr::stat_qq_band(fill = "#C4F134FF", alpha = 0.3) +
  qqplotr::stat_qq_line(colour = "#C4F134FF") +
  qqplotr::stat_qq_point(colour = "#3E9BFEFF", alpha = 0.5, size = 1) +
  labs(x = "Theoretical quantiles", y = "Sample quantiles") +
  facet_wrap(~info_type, scales = "free") +
  theme_minimal()
Figure 68: Q-Q plots of the log latency to approach the box by information condition

Task 8

Analyse the data in Task 7 with a robust model. Do children take longer to put their hands in a box that they believe contains an entity about which they have been told nasty things?

The results are shown below and they mirror the analysis that I conducted on the log-transformed values in the paper itself (in case you want to check). The main effect of the type of information was significant, F(1.24, 94.32) = 1.24, p < 0.001. The post hoc tests show a significantly longer time to approach the box containing the negative information animal compared to the positive information animal, \(\hat{\psi}\) = 2.42, pobserved < 0.001, pcrit = 0.017, and compared to the no information box, \(\hat{\psi}\) = 2.07, pobserved < 0.001, pcrit = 0.025. Children also approached the box containing the positive information animal significantly faster than the no information animal, \(\hat{\psi}\) = -0.21, pobserved = 0.014, pcrit = 0.050.

WRS2::rmanova(y = field_tib$latency, groups = field_tib$info_type, blocks = field_tib$id)
Call:
WRS2::rmanova(y = field_tib$latency, groups = field_tib$info_type, 
    blocks = field_tib$id)

Test statistic: F = 78.1521 
Degrees of freedom 1: 1.24 
Degrees of freedom 2: 94.32 
p-value: 0 
WRS2::rmmcp(y = field_tib$latency, groups = field_tib$info_type, blocks = field_tib$id)
Call:
WRS2::rmmcp(y = field_tib$latency, groups = field_tib$info_type, 
    blocks = field_tib$id)

                              psihat ci.lower ci.upper p.value p.crit  sig
Threat vs. Positive          2.41558  1.71695  3.11421 0.00000 0.0169 TRUE
Threat vs. No information    2.07013  1.35313  2.78713 0.00000 0.0250 TRUE
Positive vs. No information -0.20597 -0.40537 -0.00658 0.01351 0.0500 TRUE

Chapter 16

Coming soon

Chapter 17

Coming soon

Chapter 18

Coming soon

Chapter 19

Coming soon

References

Coldwell, J., Pike, A., & Dunn, J. (2006). Household chaos – links with parenting and child behaviour. Journal of Child Psychology and Psychiatry, 47(11), 1116–1122. https://doi.org/10.1111/j.1469-7610.2006.01655.x
Feng, L., Gwee, X., Kua, E. H., & Ng, T. P. (2010). Cognitive function and tea consumption in community dwelling older Chinese in Singapore. Journal of Nutrition Health & Aging, 14, 433–438.
Field, A. P. (2006). The behavioral inhibition system and the verbal information pathway to children’s fears. Journal of Abnormal Psychology, 115, 742–752. https://doi.org/10.1037/0021-843x.115.4.742
Field, A. P. (2010). Non-Sadistical Methods for Teaching Statistics. In Upton, Dominic & Trapp, Annie (Eds.), Teaching Psychology in Higher Education (pp. 134–163). John Wiley & Sons, Ltd. https://doi.org/10.1002/9781444320732.ch6
Field, A. P. (2026). Discovering statistics using R and RStudio (2nd ed.). Sage.
Hill, C., Abraham, C., & Wright, D. B. (2007). Can theory-based messages in combination with cognitive prompts promote exercise in classroom settings? Social Science & Medicine, 65, 1049–1058.
Lambert, N. M., Negash, S., Stillman, T. F., Olmstead, S. B., & Fincham, F. D. (2012). A love that doesn’t last: pornography consumption and weakened commitment to one’s romantic partner. Journal of Social and Clinical Psychology, 31, 410–438. https://doi.org/https://doi.org/10.1521/jscp.2012.31.4.410
McNulty, J. K., Neff, L. A., & Karney, B. R. (2008). Beyond initial attraction: physical attractiveness in newlywed marriage. Journal of Family Psychology, 22, 135–143. https://doi.org/https://doi.org/10.1037/0893-3200.22.1.135
Miller, G., Tybur, J. M., & Jordan, B. D. (2007). Ovulatory cycle effects on tip earnings by lap dancers: economic evidence for human estrus? Evolution and Human Behavior, 28, 375–381. https://doi.org/doi.org/10.1016/j.evolhumbehav.2007.06.002
Ong, E. Y. L., Ang, R. P., Ho, J. C. M., Lim, J. C. Y., Goh, D. H., Lee, C. S., & Chua, A. Y. K. (2011). Narcissism, extraversion and adolescents’ self-presentation on Facebook. Personality and Individual Differences, 50, 180–185. https://doi.org/10.1016/j.paid.2010.09.022
Oxoby, R. J. (2008). On the efficiency of AC/DC: Bon Scott versus Brian Johnson. Economic Enquiry, 47, 598–602. https://doi.org/10.1111/j.1465-7295.2008.00138.x
Payne-Carter, S., Greenberg, K., & Walker, M. (2016). The impact of computer usage on academic performance: evidence from a randomized trial at the united states military academy. MIT Department of Economics.
Selfhout, M. H. W., Delsing, M. J. M. H., ter Bogt, T. F. M., & Meeus, W. H. J. (2008). Heavy metal and hip-hop style preferences and externalizing problem behavior: a two-wave longitudinal study. Youth & Society, 39(4), 435–452. https://doi.org/10.1177/0044118X07308069
Zhang, S., Schmader, T., & Hall, W. M. (2013). L’eggo my ego: reducing the gender gap in math by unlinking the self from performance. Self and Identity, 12, 400–412. https://doi.org/10.1080/15298868.2012.687012