--- title: "R code in Chapter 8" author: "Andy P. Field" date: 'Last updated `r format(Sys.Date(), "%d %B %Y")`' output: html_document --- ```{r, include = FALSE} knitr::opts_chunk$set( warning = FALSE, message = FALSE ) library(magrittr) library(dplyr) library(ggplot2) ``` *** ## Usage This file contains code relevant to chapter 1 of [Discovering Statistics Using R and RStudio](https://www.discovr.rocks) by [Andy Field](https://www.discoveringstatistics.com/about). These files contain abridged sections from the book so there are some copyright considerations but I offer them under a [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License](http://creativecommons.org/licenses/by-nc-nd/4.0/).^[Basically you can use this material for teaching and non-profit activities but do not meddle with it or claim it as your own work.] *** ## Load packages Remember to load the tidyverse: ```{r, eval = F} library(tidyverse) ``` ## Load the data Remember to install the package for the book with `library(discovr)`. After which you can load data into a tibble by executing: `name_of_tib <- discovr::name_of_data` For example, execute these lines to create the tibbles referred to in the chapter: ```{r} album_tib <- discovr::album_sales df_tib <- discovr::df_beta pub_tib <- discovr::pubs ``` If you want to read the file from the CSV and you have set up your project folder as I suggest in Chapter 1, then the general format of the code you would use is: ```{r, eval = FALSE} tibble_name <- here::here("../data/name_of_file.csv") %>% readr::read_csv() %>% dplyr::mutate( ... code to convert character variables to factors ... ) ``` In which you'd replace `tibble_name` with the name you want to assign to the tibble and change `name_of_file.csv` to the name of the file that you're trying to load. You can use mutate to convert categorical variables to factors. For example, for the `album_sales` data you would load it by executing: ```{r, eval = FALSE} library(here) album_tib <- here::here("data/album_sales.csv") %>% readr::read_csv() ``` ## Self test on df_beta Load the data and fit the model with all cases. ```{r} df_tib <- discovr::df_beta df_lm <- df_tib %>% lm(y ~ x, data = .) df_lm %>% broom::tidy() ``` Fit the model with case 30 excluded. ```{r} df_tib %>% dplyr::filter(case != 30) %>% lm(y ~ x, data = .) %>% broom::tidy() ``` Other diagnostics (look at the values for case 30): ```{r} dfbeta(df_lm) ``` ## Jane superbrain influential cases Load the data and fit the model with all cases. ```{r} pub_tib <- discovr::pubs pub_lm <- pub_tib %>% lm(mortality ~ pubs, data = .) pub_lm %>% broom::tidy() ``` Save influential cases ```{r} pub_inf <- pub_lm %>% broom::augment() %>% dplyr::rename( Residual = .resid, `Cook's distance` = .cooksd, Leverage = .hat, ) %>% dplyr::mutate( District = 1:8, `DF beta (intercept)` = dfbeta(pub_lm)[, 1] %>% round(., 3), `DF beta (pubs)` = dfbeta(pub_lm)[, 2] %>% round(., 3), ) %>% dplyr::select(District, Residual, `Cook's distance`, Leverage, `DF beta (intercept)`, `DF beta (pubs)`) %>% round(., 3) pub_inf ``` ## Summary information for album sales ### Summary statistics ```{r} album_sum <- album_tib %>% tidyr::pivot_longer( cols = adverts:image, names_to = "Variable", values_to = "value" ) %>% dplyr::group_by(Variable) %>% dplyr::summarize( Mean = mean(value, na.rm = TRUE), SD = sd(value, na.rm = TRUE), `95% CI Lower` = ggplot2::mean_cl_normal(value)$ymin, `95% CI Upper` = ggplot2::mean_cl_normal(value)$ymax, Min = min(value, na.rm = TRUE), Max = max(value, na.rm = TRUE), ) album_sum ``` ### Scatterplot of album sales against advertising ```{r} ggplot2::ggplot(album_tib, aes(adverts, sales)) + geom_point(colour = "#5c97bf") + coord_cartesian(ylim = c(0, 400), xlim = c(0, 2500)) + scale_x_continuous(breaks = seq(0, 2500, 500)) + scale_y_continuous(breaks = seq(0, 400, 100)) + labs(x = "Advertising budget (thousands)", y = "Album sales (thousands)") + geom_smooth(method = "lm", colour = "#d47500", fill = "#d47500", alpha = 0.2) + theme_minimal() ``` ## Fit a model with one predictor ```{r} album_lm <- lm(sales ~ adverts, data = album_tib, na.action = na.exclude) ``` get text output for the model: ```{r} summary(album_lm) ``` Get the confidence intervals for the model parameters old-style: ```{r} confint(album_lm) ``` Or, better .... use `broom()` to get nicely tabulated fit statistics: ```{r} broom::glance(album_lm) ``` Finally, you can compute multiple *R*: ```{r} summary(album_lm)$r.squared %>% sqrt() ``` Also use `broom()` to get nicely tabulated parameter estimates (including confidence intervals): ```{r} broom::tidy(album_lm, conf.int = TRUE) ``` Impress your friends by rounding values to 3 decimal places using `mutate_if()` ```{r} broom::tidy(album_lm, conf.int = TRUE) %>% dplyr::mutate_if( vars(is.numeric(.)), list(~round(., 3)) ) ``` Or by using the `pixiedust` package: ```{r} broom::tidy(album_lm, conf.int = TRUE) %>% pixiedust::dust() %>% pixiedust::sprinkle(col = 2:7, round = 3) ``` ## Fit a model with several predictor ### Bivariate correlations and scatterplots Plot the data ```{r} GGally::ggscatmat(album_tib, columns = c("adverts", "airplay", "image", "sales")) + theme_minimal() ``` ### Build the model ```{r} album_full_lm <- lm(sales ~ adverts + airplay + image, data = album_tib, na.action = na.exclude) ``` Or, use the update function: ```{r} album_full_lm <- update(album_lm, .~. + airplay + image) ``` ### Fit statistics ```{r} broom::glance(album_full_lm) ``` multiple R-square: ```{r} summary(album_full_lm)$r.squared %>% sqrt() ``` ### Compare models ```{r} anova(album_lm, album_full_lm) %>% broom::tidy() ``` ### Parameter estimates As text: ```{r} summary(album_full_lm) confint(album_full_lm) ``` As a nicely formatted table: ```{r} broom::tidy(album_full_lm, conf.int = TRUE) ``` Restrict the output to 3 decimal places ```{r} broom::tidy(album_full_lm, conf.int = TRUE) %>% dplyr::mutate_if( vars(is.numeric(.)), list(~round(., 3)) ) ``` ### Standardized betas ```{r} parameters::model_parameters(album_full_lm, standardize = "refit", digits = 3) ``` ### Influence measures ```{r} album_full_rsd <- album_full_lm %>% broom::augment() %>% tibble::rowid_to_column(var = "case_no") album_full_rsd ``` ```{r} album_full_inf <- influence.measures(album_full_lm) album_full_inf <- album_full_inf$infmat %>% tibble::as_tibble() %>% tibble::rowid_to_column(var = "case_no") album_full_inf ``` If you want to keep all of your diagnostic stats in one object: ```{r} album_full_rsd <- album_full_rsd %>% dplyr::left_join(., album_full_inf, by = "case_no") %>% dplyr::select(-c(cook.d, hat)) ``` ### Variance inflation factor ```{r} car::vif(album_full_lm) 1/car::vif(album_full_lm) car::vif(album_full_lm) %>% mean() ``` ### Casewise diagnostics ```{r} get_cum_percent <- function(var, cut_off = 1.96){ ecdf_var <- abs(var) %>% ecdf() 100*(1 - ecdf_var(cut_off)) } album_full_rsd %>% dplyr::summarize( `z >= 1.96` = get_cum_percent(.std.resid), `z >= 2.58` = get_cum_percent(.std.resid, cut_off = 2.58), `z >= 3.29` = get_cum_percent(.std.resid, cut_off = 3.29) ) ``` ```{r} album_full_rsd %>% dplyr::filter(abs(.std.resid) >= 1.96) %>% dplyr::select(case_no, .std.resid, .resid) %>% dplyr::arrange(.std.resid) ``` ```{r} album_full_inf %>% dplyr::arrange(desc(cook.d)) %>% dplyr::mutate_if( vars(is.numeric(.)), list(~round(., 3)) ) ``` Show case with any df beta with an absolute value greater than 1. ```{r} album_full_inf %>% dplyr::filter_at( vars(starts_with("dfb")), any_vars(abs(.) > 1) ) %>% dplyr::select(case_no, starts_with("dfb")) %>% dplyr::mutate_if( vars(is.numeric(.)), list(~round(., 3)) ) ``` Show cases with problematic leverage or CVR: ```{r} leverage_thresh <- 3*mean(album_full_inf$hat, na.rm = TRUE) album_full_inf %>% dplyr::filter( hat > leverage_thresh | !between(cov.r, 1 - leverage_thresh, 1 + leverage_thresh) ) %>% dplyr::select(case_no, cov.r, hat, cook.d) %>% dplyr::mutate_if( vars(is.numeric(.)), list(~round(., 3)) ) ``` ### Diagnostic plots Standard influence plots ```{r} plot(album_full_lm, which = 4:6) ``` Residual plot: ```{r} plot(album_full_lm, which = c(1, 3)) ``` Q-Q plot: ```{r} plot(album_full_lm, which = 2) ``` ### Prettier diagnostic plots The `ggfortify` package allows us to produce prettier versions of the diagnostic plots. For example: influence plots ```{r} library(ggfortify) ggplot2::autoplot(album_full_lm, which = 4:6, colour = "#5c97bf", smooth.colour = "#ef4836", alpha = 0.5, size = 1) + theme_minimal() ``` Residual plot: ```{r} ggplot2::autoplot(album_full_lm, which = c(1, 3), colour = "#5c97bf", smooth.colour = "#ef4836", alpha = 0.5, size = 1) + theme_minimal() ``` Q-Q plot: ```{r} ggplot2::autoplot(album_full_lm, which = 2, colour = "#5c97bf", smooth.colour = "#ef4836", alpha = 0.5, size = 1) + theme_minimal() ``` ### Robust linear model Robust estimates: ```{r} album_full_rob <- robust::lmRob(sales ~ adverts + airplay + image, data = album_tib, na.action = na.exclude) summary(album_full_rob) ``` Robust standard errors and confidence intervals: ```{r} parameters::model_parameters( album_full_lm, robust = TRUE, vcov.type = "HC4", digits = 3 ) ``` Bootstrap ```{r} parameters::model_parameters( album_full_lm, bootstrap = TRUE, digits = 3 ) ``` ### Bayes factors ```{r} album_bf <- album_tib %>% BayesFactor::regressionBF(sales ~ adverts + airplay + image, rscaleCont = "medium", data = .) album_bf ``` ### Bayesian parameter estimates ```{r} album_full_bf <- album_tib %>% BayesFactor::lmBF(sales ~ adverts + airplay + image, rscaleCont = "medium", data = .) album_full_post <- BayesFactor::posterior(album_full_bf, iterations = 10000) summary(album_full_post) ```