R code Chapter 7

Mae Jemstone character from Discovering Statistics using R and RStudio

This document contains abridged sections from Discovering Statistics Using R and RStudio by Andy Field so there are some copyright considerations. You can use this material for teaching and non-profit activities but please do not meddle with it or claim it as your own work. See the full license terms at the bottom of the page.

Load the data

Remember to install the package for the book with library(discovr). After which you can load data into a tibble by executing:

name_of_tib <- discovr::name_of_data

For example, execute these lines to create the tibbles referred to in the chapter:

cats_tib <- discovr::roaming_cats
exam_tib <- discovr::exam_anxiety
liar_tib <- discovr::biggest_liar

If you want to read the file from the CSV and you have set up your project folder as I suggest in Chapter 1, then the general format of the code you would use is:

tibble_name <- here::here("../data/name_of_file.csv") %>%
  readr::read_csv() %>%
  dplyr::mutate(
    ...
    code to convert character variables to factors
    ...
  )

In which you’d replace tibble_name with the name you want to assign to the tibble and change name_of_file.csv to the name of the file that you’re trying to load. You can use mutate to convert categorical variables to factors. For example, for the exam_anxiety data you would load it by executing:

exam_tib <- here::here("data/exam_anxiety.csv") %>%
  readr::read_csv() %>%
  dplyr::mutate(
    sex = forcats::as_factor(sex)
  )

Bivariate correlations

Plot the data

GGally::ggscatmat(exam_tib, columns = c("exam_grade", "revise", "anxiety")) +
  theme_minimal()

Pearson’s r

A single correlation:

exam_tib %>% 
  dplyr::select(exam_grade, revise) %>% 
  correlation::correlation()

## Parameter1 | Parameter2 |    r |       95% CI |    t |  df |      p |  Method | n_Obs
## -------------------------------------------------------------------------------------
## exam_grade |     revise | 0.40 | [0.22, 0.55] | 4.34 | 101 | < .001 | Pearson |   103

Let’s scale this up to include exam anxiety:

exam_cor <- exam_tib %>% 
  dplyr::select(exam_grade, revise, anxiety) %>% 
  correlation::correlation()

exam_cor

## Parameter1 | Parameter2 |     r |         95% CI |      t |  df |      p |  Method | n_Obs
## ------------------------------------------------------------------------------------------
## exam_grade |     revise |  0.40 | [ 0.22,  0.55] |   4.34 | 101 | < .001 | Pearson |   103
## exam_grade |    anxiety | -0.44 | [-0.58, -0.27] |  -4.94 | 101 | < .001 | Pearson |   103
## revise     |    anxiety | -0.71 | [-0.79, -0.60] | -10.11 | 101 | < .001 | Pearson |   103

Change the number of decimal places

exam_tib %>% 
  dplyr::select(exam_grade, revise, anxiety) %>% 
  correlation::correlation(digits = 3, ci_digits = 3)

## Parameter1 | Parameter2 |      r |           95% CI |       t |  df |      p |  Method | n_Obs
## ----------------------------------------------------------------------------------------------
## exam_grade |     revise |  0.397 | [ 0.220,  0.548] |   4.343 | 101 | < .001 | Pearson |   103
## exam_grade |    anxiety | -0.441 | [-0.585, -0.271] |  -4.938 | 101 | < .001 | Pearson |   103
## revise     |    anxiety | -0.709 | [-0.794, -0.598] | -10.111 | 101 | < .001 | Pearson |   103

Matrix format

exam_tib %>% 
  dplyr::select(exam_grade, revise, anxiety) %>% 
  correlation::correlation() %>% 
  summary()

## Parameter  |  anxiety |  revise
## -------------------------------
## exam_grade | -0.44*** | 0.40***
## revise     | -0.71*** |

Coefficient of determination

exam_cor <- exam_tib %>% 
  dplyr::select(exam_grade, revise, anxiety) %>% 
  correlation::correlation()

(exam_cor$r)^2

## [1] 0.1573873 0.1944752 0.5030345

Robust correlations

exam_tib %>% 
  dplyr::select(exam_grade, revise, anxiety) %>% 
  WRS2::winall()

## Call:
## WRS2::winall(x = .)
## 
## Robust correlation matrix:
##            exam_grade  revise anxiety
## exam_grade     1.0000  0.3085 -0.3913
## revise         0.3085  1.0000 -0.6015
## anxiety       -0.3913 -0.6015  1.0000
## 
## p-values:
##            exam_grade  revise anxiety
## exam_grade         NA 0.00183   7e-05
## revise        0.00183      NA   0e+00
## anxiety       0.00007 0.00000      NA

exam_tib %>% 
  dplyr::select(exam_grade, revise, anxiety) %>% 
  correlation::correlation(., method = "percentage")

## Parameter1 | Parameter2 |     r |         95% CI |     t |  df |      p |          Method | n_Obs
## -------------------------------------------------------------------------------------------------
## exam_grade |     revise |  0.34 | [ 0.15,  0.50] |  3.60 | 101 | < .001 | Percentage Bend |   103
## exam_grade |    anxiety | -0.40 | [-0.55, -0.23] | -4.41 | 101 | < .001 | Percentage Bend |   103
## revise     |    anxiety | -0.61 | [-0.72, -0.47] | -7.66 | 101 | < .001 | Percentage Bend |   103

Spearman’s rho

Load the Biggest Liar data from the CSV file (assuming you’ve set up a project as described in chapter 1)

liar_tib <- readr::read_csv("../data/biggest_liar.csv") %>% 
  dplyr::mutate(
    novice = forcats::as_factor(novice)
  )

liar_tib %>%
  dplyr::select(position, creativity) %>% 
  correlation::correlation(method = "spearman")

## Parameter1 | Parameter2 |   rho |         95% CI |        S |     p |   Method | n_Obs
## --------------------------------------------------------------------------------------
## position   | creativity | -0.38 | [-0.56, -0.15] | 72123.51 | 0.002 | Spearman |    68

Kendall’s tau

liar_tib %>% 
  dplyr::select(position, creativity) %>% 
  correlation::correlation(method = "kendall")

## Parameter1 | Parameter2 |         95% CI |   tau |     z |     p |  Method | n_Obs
## ----------------------------------------------------------------------------------
## position   | creativity | [-0.50, -0.07] | -0.30 | -3.24 | 0.001 | Kendall |    68

Bootstrapped confidence intervals

A quick aside, let’s look at the code for writing a function to print the mean:

print_mean <- function(variable){
	mean <- sum(variable)/nrow(variable)
       cat("Mean = ", mean)
}

exam_tib %>% 
  dplyr::select(exam_grade) %>%
  print_mean()

## Mean =  56.57282

Or a sillier version:

print_mean <- function(harry_the_hungy_hippo){
	mean <- sum(harry_the_hungy_hippo)/ nrow(harry_the_hungy_hippo)
       cat("Harry the hungry hippo say that the mean = ", mean)
}

exam_tib %>% 
  dplyr::select(exam_grade) %>%
  print_mean()

## Harry the hungry hippo say that the mean =  56.57282

Here’s a function to bootstrap Pearson’s r:

boot_r <- function(data, i){
  cor(data[i, "exam_grade"], data[i, "revise"])
}

grade_revise_bs <- boot::boot(exam_tib, boot_r, R = 2000)

grade_revise_bs

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot::boot(data = exam_tib, statistic = boot_r, R = 2000)
## 
## 
## Bootstrap Statistics :
##      original       bias    std. error
## t1* 0.3967207 -0.001476077  0.06983568

boot::boot.ci(grade_revise_bs)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
## 
## CALL : 
## boot::boot.ci(boot.out = grade_revise_bs)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   ( 0.2613,  0.5351 )   ( 0.2673,  0.5424 )  
## 
## Level     Percentile            BCa          
## 95%   ( 0.2510,  0.5261 )   ( 0.2530,  0.5263 )  
## Calculations and Intervals on Original Scale

A general bootstrap function

boot_r <- function(data, var1, var2, i){
  cor(data[i, var1], data[i, var2])
}

grade_revise_bs <- boot::boot(exam_tib, boot_r, var1 = "exam_grade", var2 = "revise", R = 2000)
grade_revise_bs

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot::boot(data = exam_tib, statistic = boot_r, R = 2000, var1 = "exam_grade", 
##     var2 = "revise")
## 
## 
## Bootstrap Statistics :
##      original       bias    std. error
## t1* 0.3967207 -0.001593843  0.06788372

boot::boot.ci(grade_revise_bs)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
## 
## CALL : 
## boot::boot.ci(boot.out = grade_revise_bs)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   ( 0.2653,  0.5314 )   ( 0.2732,  0.5405 )  
## 
## Level     Percentile            BCa          
## 95%   ( 0.2529,  0.5202 )   ( 0.2518,  0.5198 )  
## Calculations and Intervals on Original Scale

grade_anx_bs <- boot::boot(exam_tib, boot_r, var1 = "exam_grade", var2 = "anxiety", R = 2000)
grade_anx_bs

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot::boot(data = exam_tib, statistic = boot_r, R = 2000, var1 = "exam_grade", 
##     var2 = "anxiety")
## 
## 
## Bootstrap Statistics :
##       original      bias    std. error
## t1* -0.4409934 0.002270827   0.0644994

boot::boot.ci(grade_anx_bs)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
## 
## CALL : 
## boot::boot.ci(boot.out = grade_anx_bs)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   (-0.5697, -0.3168 )   (-0.5767, -0.3235 )  
## 
## Level     Percentile            BCa          
## 95%   (-0.5585, -0.3053 )   (-0.5609, -0.3088 )  
## Calculations and Intervals on Original Scale

revise_anx_bs <- boot::boot(exam_tib, boot_r, var1 = "revise", var2 = "anxiety", R = 2000)
revise_anx_bs

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot::boot(data = exam_tib, statistic = boot_r, R = 2000, var1 = "revise", 
##     var2 = "anxiety")
## 
## 
## Bootstrap Statistics :
##       original      bias    std. error
## t1* -0.7092493 0.001022998    0.108373

boot::boot.ci(revise_anx_bs)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
## 
## CALL : 
## boot::boot.ci(boot.out = revise_anx_bs)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   (-0.9227, -0.4979 )   (-0.9525, -0.5483 )  
## 
## Level     Percentile            BCa          
## 95%   (-0.8702, -0.4660 )   (-0.8530, -0.3760 )  
## Calculations and Intervals on Original Scale

Bootrapping Spearman’s rho and Kendall’s tau

boot_r <- function(data, var1, var2, method = "pearson", i){
  cor(data[i, var1], data[i, var2], method = method)
}

liar_bs <- boot::boot(liar_tib, boot_r, var1 = "position", var2 = "creativity", method = "spearman", R = 2000)
boot::boot.ci(liar_bs)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
## 
## CALL : 
## boot::boot.ci(boot.out = liar_bs)
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   (-0.6204, -0.1376 )   (-0.6294, -0.1555 )  
## 
## Level     Percentile            BCa          
## 95%   (-0.5976, -0.1238 )   (-0.5818, -0.1042 )  
## Calculations and Intervals on Original Scale

Point-biserial correlation

If you need to read the data in from the csv file:

cats_tib <- readr::read_csv("../data/roaming_cats.csv") %>% 
  dplyr::mutate(
    sex = forcats::as_factor(sex)
  )

cats_tib <- cats_tib %>% 
  dplyr::mutate(
    sex_bin = ifelse(sex == "Male", 0, 1),
    sex_bin_recode = ifelse(sex == "Male", 1, 0)
  )

cats_tib %>% 
  dplyr::select(time, sex_bin) %>%
  correlation::correlation()

## Parameter1 | Parameter2 |     r |         95% CI |     t | df |     p |  Method | n_Obs
## ---------------------------------------------------------------------------------------
## time       |    sex_bin | -0.38 | [-0.58, -0.14] | -3.11 | 58 | 0.003 | Pearson |    60

coefficient of determination

r_cat <- cats_tib %>% 
  dplyr::select(time, sex_bin) %>%
  correlation::correlation()

(r_cat$r)^2

## [1] 0.1432276

cats_tib %>% 
  dplyr::select(time, sex_bin_recode) %>%
  correlation::correlation()

## Parameter1 |     Parameter2 |    r |       95% CI |    t | df |     p |  Method | n_Obs
## ---------------------------------------------------------------------------------------
## time       | sex_bin_recode | 0.38 | [0.14, 0.58] | 3.11 | 58 | 0.003 | Pearson |    60

cats_tib %>% 
  dplyr::select(time, sex_bin) %>%
  correlation::correlation(method = "biserial")

## Parameter1 | Parameter2 |   rho |         95% CI |     t | df |      p |   Method | n_Obs
## -----------------------------------------------------------------------------------------
## time       |    sex_bin | -0.47 | [-0.65, -0.25] | -4.03 | 58 | < .001 | Biserial |    60

Partial correlations

exam_tib %>% 
  dplyr::select(exam_grade, revise, anxiety) %>% 
  correlation::correlation(., partial = TRUE)

## Parameter1 | Parameter2 |     r |         95% CI |     t |  df |      p |  Method | n_Obs
## -----------------------------------------------------------------------------------------
## exam_grade |     revise |  0.13 | [-0.06,  0.32] |  1.35 | 101 | 0.182  | Pearson |   103
## exam_grade |    anxiety | -0.25 | [-0.42, -0.06] | -2.56 | 101 | 0.024  | Pearson |   103
## revise     |    anxiety | -0.65 | [-0.75, -0.52] | -8.56 | 101 | < .001 | Pearson |   103

Comparing rs

The correlation between exam grade and anxiety for self-identifying males:

exam_tib %>% 
  dplyr::filter(sex == "Male") %>% 
  dplyr::select(exam_grade, anxiety) %>% 
  correlation::correlation()

## Parameter1 | Parameter2 |     r |         95% CI |     t | df |      p |  Method | n_Obs
## ----------------------------------------------------------------------------------------
## exam_grade |    anxiety | -0.51 | [-0.68, -0.27] | -4.14 | 50 | < .001 | Pearson |    52

The correlation between exam grade and anxiety for self-identifying females:

exam_tib %>% 
  dplyr::filter(sex == "Female") %>% 
  dplyr::select(exam_grade, anxiety) %>% 
  correlation::correlation()

## Parameter1 | Parameter2 |     r |         95% CI |     t | df |     p |  Method | n_Obs
## ---------------------------------------------------------------------------------------
## exam_grade |    anxiety | -0.38 | [-0.59, -0.12] | -2.89 | 49 | 0.006 | Pearson |    51

exam_male <- exam_tib %>% 
  dplyr::filter(sex == "Male")

exam_female <- exam_tib %>% 
  dplyr::filter(sex == "Female")

WRS2::twopcor(x1 = exam_male$exam_grade, y1 = exam_male$anxiety, x2 = exam_female$exam_grade, y2 = exam_female$anxiety)

## Call:
## WRS2::twopcor(x1 = exam_male$exam_grade, y1 = exam_male$anxiety, 
##     x2 = exam_female$exam_grade, y2 = exam_female$anxiety)
## 
## First correlation coefficient: -0.5057
## Second correlation coefficient: -0.3814
## Confidence interval (difference): -0.4574 0.1315

Last updated on May 29, 2025