# Smart Alex solutions Chapter 8

This document contains abridged sections from Discovering Statistics Using R and RStudio by Andy Field so there are some copyright considerations. You can use this material for teaching and non-profit activities but please do not meddle with it or claim it as your own work. See the full license terms at the bottom of the page.

We looked at data based on findings that the number of cups of tea drunk was related to cognitive functioning (Feng, Gwee, Kua, & Ng, 2010). Using a linear model that predicts cognitive functioning from tea drinking, what would cognitive functioning be if someone drank 10 cups of tea? Is there a significant effect? (Tea Makes You Brainy 716.csv)

Load the data and fit the model

tea716_tib <- discovr::tea_716

tea_lm <- lm(cog_fun ~ tea, data = tea716_tib, na.action = na.exclude)
tea_lm %>%
broom::tidy() %>%
knitr::kable(digits = 3)

termestimatestd.errorstatisticp.value
(Intercept)49.2180.76464.3820.000
tea0.4600.2212.0810.038

Looking at the output below, we can see that we have a model that significantly improves our ability to predict cognitive functioning. The positive beta value (0.46) indicates a positive relationship between number of cups of tea drunk per day and level of cognitive functioning, in that the more tea drunk, the higher your level of cognitive functioning. We can then use the model to predict level of cognitive functioning after drinking 10 cups of tea per day. The first stage is to define the model by replacing the b-values in the equation below with the values from the Coefficients output. In addition, we can replace the X and Y with the variable names so that the model becomes:

\begin{aligned} \text{Cognitive functioning}_i &= b_0 + b_1 \text{Tea drinking}_i \\ &= 49.22 +(0.460 \times \text{Tea drinking}_i) \end{aligned}

We can predict cognitive functioning, by replacing Tea drinking in the equation with the value 10:

\begin{aligned} \text{Cognitive functioning}_i &= 49.22 +(0.460 \times \text{Tea drinking}_i) \\ &= 49.22 +(0.460 \times 10) \\ &= 53.82 \end{aligned}

Therefore, if you drank 10 cups of tea per day, your level of cognitive functioning would be 53.82.

Estimate a linear model for the pubs.csv data predicting mortality from the number of pubs. Try repeating the analysis but bootstrapping the confidence intervals.

Load the data and fit the model

pubs_tib <- discovr::pubs

pub_lm <- lm(mortality ~ pubs, data = pubs_tib, na.action = na.exclude)

pub_lm %>%
broom::glance() %>%
knitr::kable(digits = 3)

0.6490.5911864.43111.1170.0161-70.446146.893147.1312085661168
pub_lm %>%
broom::tidy() %>%
knitr::kable(digits = 3)

termestimatestd.errorstatisticp.value
(Intercept)3351.955781.2364.2910.005
pubs14.3394.3013.3340.016

Looking at the output, we can see that the number of pubs significantly predicts mortality, t(6) = 3.33, p = 0.016. The positive beta value (14.34) indicates a positive relationship between number of pubs and death rate in that, the more pubs in an area, the higher the rate of mortality (as we would expect). The value of $R^2$ tells us that number of pubs accounts for 64.9% of the variance in mortality rate – that’s over half!

pub_lm %>%
parameters::model_parameters(bootstrap = TRUE) %>%
knitr::kable(digits = 3)

ParameterCoefficientCI_lowCI_highp
(Intercept)2777.6100.0004710.8240.523
pubs15.17511.161100.0000.000

The bootstrapped confidence intervals are both positive values – they do not cross zero (10.92, 100.00) – then assuming this interval is one of the 95% that contain the population value we can gain confidence that there is a positive and non-zero relationship between number of pubs in an area and its mortality rate.

We encountered data (honesty_lab.csv) relating to people’s ratings of dishonest acts and the likeableness of the perpetrator. Run a linear model with bootstrapping to predict ratings of dishonesty from the likeableness of the perpetrator.

Load the data and fit the model

honesty_tib <- discovr::honesty_lab

honest_lm <- lm(deed ~ likeableness, data = honesty_tib, na.action = na.exclude)

honest_lm %>%
broom::glance() %>%
knitr::kable(digits = 3)

0.6910.6881.203219.09701-159.406324.812332.627141.94198100
honest_lm %>%
broom::tidy() %>%
knitr::kable(digits = 3)

termestimatestd.errorstatisticp.value
(Intercept)-1.8610.328-5.6730
likeableness0.9390.06314.8020

Looking at the output we can see that the likeableness of the perpetrator significantly predicts ratings of dishonest acts, t(98) = 14.80, p < 0.001. The positive beta value (0.94) indicates a positive relationship between likeableness of the perpetrator and ratings of dishonesty, in that, the more likeable the perpetrator, the more positively their dishonest acts were viewed (remember that dishonest acts were measured on a scale from 0 = appalling behaviour to 10 = it’s OK really). The value of $$R^2$$ tells us that likeableness of the perpetrator accounts for 69.1% of the variance in the rating of dishonesty, which is over half.

honest_lm %>%
parameters::model_parameters(bootstrap = TRUE) %>%
knitr::kable(digits = 3)

ParameterCoefficientCI_lowCI_highp
(Intercept)-1.854-2.521-1.3530
likeableness0.9370.8131.0690

The bootstrapped confidence intervals do not cross zero (0.82, 1.08), then assuming this interval is one of the 95% that contain the population value we can gain confidence that there is a non-zero relationship between the likeableness of the perpetrator and ratings of dishonest acts.

A fashion student was interested in factors that predicted the salaries of male and female catwalk models. She collected data from 231 models (supermodel.csv). For each model she asked them their salary per day (salary), their age (age), their length of experience as models (years), and their industry status as a model as their percentile position rated by a panel of experts (status). Use a linear model to see which variables predict a model’s salary. How valid is the model?

### The model

Load the data and fit the model

model_tib <- discovr::supermodel

model_lm <- lm(salary ~ age + years + status, data = model_tib, na.action = na.exclude)

model_lm %>%
broom::glance() %>%
knitr::kable(digits = 3)

0.1840.17314.57217.06603-944.6321899.2641916.47648202.79227231
model_lm %>%
broom::tidy() %>%
knitr::kable(digits = 3)

termestimatestd.errorstatisticp.value
(Intercept)-60.89016.497-3.6910.000
age6.2341.4114.4180.000
years-5.5612.122-2.6210.009
status-0.1960.152-1.2890.199

To begin with, a sample size of 231 with three predictors seems reasonable because this would easily detect medium to large effects (see the diagram in the chapter). Overall, the model accounts for 18.4% of the variance in salaries and is a significant fit to the data, F(3, 227) = 17.07, p < .001. The adjusted $R^2$ (0.17) shows some shrinkage from the unadjusted value (0.184), indicating that the model may not generalize well.

It seems as though salaries are significantly predicted by the age of the model. This is a positive relationship (look at the sign of the beta), indicating that as age increases, salaries increase too. The number of years spent as a model also seems to significantly predict salaries, but this is a negative relationship indicating that the more years you’ve spent as a model, the lower your salary. This finding seems very counter-intuitive, but we’ll come back to it later. Finally, the status of the model doesn’t seem to predict salaries significantly. If we wanted to write the regression model, we could write it as:

The next part of the question asks whether this model is valid.

### Residuals

model_rsd <- model_lm %>%
broom::augment() %>%
tibble::rowid_to_column(var = "case_no")


The table shows about 5% of standardized residuals have absolute values above 1.96, which is what we would expect. About 3% of standardized residuals have absolute values above 2.58, which is more than we would expect. About 2% of standardized residuals have absolute values above 3.29, which is potentially problematic.

get_cum_percent <- function(var,  cut_off = 1.96){
ecdf_var <- abs(var) %>% ecdf()
100*(1 - ecdf_var(cut_off))
}

model_rsd %>%
dplyr::summarize(
z >= 1.96 = get_cum_percent(.std.resid),
z >= 2.58 = get_cum_percent(.std.resid, cut_off = 2.58),
z >= 3.29 = get_cum_percent(.std.resid, cut_off = 3.29)
) %>%
knitr::kable(digits = 3)

z >= 1.96z >= 2.58z >= 3.29
5.1953.032.165
model_rsd %>%
dplyr::filter(abs(.std.resid) >= 1.96) %>%
dplyr::select(case_no, .std.resid, .resid) %>%
dplyr::arrange(desc(.std.resid)) %>%
knitr::kable(digits = 3)

case_no.std.resid.resid
1354.717-68.085
54.697-67.073
1983.531-51.148
1163.440-49.865
1553.319-47.458
1913.178-45.939
1272.778-40.113
412.421-35.139
242.242-32.523
22.215-31.853
1702.200-31.625
912.099-30.046

There are six cases that have a standardized residual greater than 3, and two of these are fairly substantial (case 5 and 135). We have 5.19% of cases with standardized residuals above 2, so that’s as we expect, but 3% of cases with residuals above 2.5 (we’d expect only 1%), which indicates possible outliers.

### Normality of errors

plot(model_lm, which = 2)


The normal Q-Q plot shows a highly non-normal distribution of residuals because the dashed line deviates considerably from the straight line (which indicates what you’d get from normally distributed errors).

### Homoscedasticity and independence of errors

plot(model_lm, which = c(1, 3))


The scatterplots of fitted values vs residuals do not show a random pattern. There is a distinct funnelling, and the red trend line is not flat, indicating heteroscedasticity.

### Multicollinearity

car::vif(model_lm)

##       age     years    status
## 12.652841 12.156757  1.153364

1/car::vif(model_lm)

##        age      years     status
## 0.07903364 0.08225878 0.86702902

car::vif(model_lm) %>%
mean()

## [1] 8.65432


For the age and experience variables in the model, VIF values are above 10 (or alternatively, tolerance values are all well below 0.2), indicating multicollinearity in the data. In fact, the correlation between these two variables is around .9! So, these two variables are measuring very similar things. Of course, this makes perfect sense because the older a model is, the more years she would’ve spent modelling! So, it was fairly stupid to measure both of these things! This also explains the weird result that the number of years spent modelling negatively predicted salary (i.e. more experience = less salary!): in fact if you do a simple regression with experience as the only predictor of salary you’ll find it has the expected positive relationship. This hopefully demonstrates why multicollinearity can bias the regression model. All in all, several assumptions have not been met and so this model is probably fairly unreliable.

A study was carried out to explore the relationship between aggression and several potential predicting factors in 666 children who had an older sibling. Variables measured were parenting_style (high score = bad parenting practices), computer_games (high score = more time spent playing computer games), television (high score = more time spent watching television), diet (high score = the child has a good diet low in harmful additives), and sibling_aggression (high score = more aggression seen in their older sibling). Past research indicated that parenting style and sibling aggression were good predictors of the level of aggression in the younger child. All other variables were treated in an exploratory fashion. Analyse them with a linear model (child_aggression.csv).

Load the data and fit the model. We need to conduct this analysis hierarchically, entering parenting_style and sibling_aggression in the first step (forced entry) and the remaining variables in a second step using a stepwise method (applying the step() function:

aggress_tib <- discovr::child_aggression

agress_lm <- lm(aggression ~ parenting_style + sibling_aggression, data = aggress_tib, na.action = na.exclude)
agress_full_lm <- update(agress_lm, .~. + television + computer_games + diet) %>%
step()

## Start:  AIC=-1566.61
## aggression ~ parenting_style + sibling_aggression + television +
##     computer_games + diet
##
##                      Df Sum of Sq    RSS     AIC
## - television          1   0.04817 62.288 -1568.1
## <none>                            62.240 -1566.6
## - sibling_aggression  1   0.41840 62.658 -1564.2
## - diet                1   0.77357 63.014 -1560.4
## - computer_games      1   1.39822 63.638 -1553.8
## - parenting_style     1   1.42809 63.668 -1553.5
##
## Step:  AIC=-1568.1
## aggression ~ parenting_style + sibling_aggression + computer_games +
##     diet
##
##                      Df Sum of Sq    RSS     AIC
## <none>                            62.288 -1568.1
## - sibling_aggression  1   0.48054 62.769 -1565.0
## - diet                1   0.81815 63.106 -1561.4
## - computer_games      1   1.42689 63.715 -1555.0
## - parenting_style     1   2.28530 64.573 -1546.1

agress_full_lm %>%
broom::glance() %>%
knitr::kable(digits = 3)

0.0820.0760.30714.73504-155.963323.927350.93562.288661666
agress_full_lm %>%
broom::tidy() %>%
knitr::kable(digits = 3)

termestimatestd.errorstatisticp.value
(Intercept)-0.0060.012-0.4970.619
parenting_style0.0620.0134.9250.000
sibling_aggression0.0860.0382.2580.024
computer_games0.1430.0373.8910.000
diet-0.1120.038-2.9470.003
agress_full_lm %>%
parameters::model_parameters(standardize = "refit", digits = 3) %>%
knitr::kable()

ParameterCoefficientSECI_lowCI_hightdf_errorp
(Intercept)0.00000000.0372413-0.07312560.07312560.0000006611.0000000
parenting_style0.19377320.03934810.11651090.27103564.9245866610.0000011
sibling_aggression0.08831140.03910700.01152250.16510042.2581986610.0242586
computer_games0.15348580.03944340.07603630.23093533.8912916610.0001098
diet-0.11776250.0399661-0.1962384-0.0392867-2.9465606610.0033264
agress_full_lm$anova  ## Step Df Deviance Resid. Df Resid. Dev AIC ## 1 NA NA 660 62.23999 -1566.614 ## 2 - television 1 0.04816702 661 62.28816 -1568.099  Based on the final model (which is actually all we’re interested in) the following variables predict aggression: • Parenting style (b = 0.062,$ \beta $= 0.194, t = 4.93, p < 0.001) significantly predicted aggression. The beta value indicates that as parenting increases (i.e. as bad practices increase), aggression increases also. • Sibling aggression (b = 0.086,$ \beta $= 0.088, t = 2.26, p = 0.024) significantly predicted aggression. The beta value indicates that as sibling aggression increases (became more aggressive), aggression increases also. • Computer games (b = 0.143,$ \beta $= 0.037, t= 3.89, p < .001) significantly predicted aggression. The beta value indicates that as the time spent playing computer games increases, aggression increases also. • Good diet (b = –0.112,$ \beta $= –0.118, t = –2.95, p = 0.003) significantly predicted aggression. The beta value indicates that as the diet improved, aggression decreased. The only factor not to predict aggression significantly was: • Television did not predict aggression (base don the change in the AIC). Based on the standardized beta values, the most substantive predictor of aggression was actually parenting style, followed by computer games, diet and then sibling aggression.$ R^2 \$ is the squared correlation between the observed values of aggression and the values of aggression predicted by the model, 8.2% of the variance in aggression is explained. With all four of these predictors in the model only 8.2% of the variance in aggression can be explained.

The Q-Q plot suggests that errors may well deviate from normally distributed:

plot(agress_full_lm, which = 2)


The scatterplot of predicted values vs residuals helps us to assess both homoscedasticity and independence of errors. The plot does show a random scatter, but the trend line is not completely flat. It would be useful to run a robust model.

plot(agress_full_lm, which = c(1, 3))


Using robust standard errors, sibling aggression is no longer significant.

agress_full_lm %>%
parameters::model_parameters(robust = TRUE, vcov.type = "HC4") %>%
knitr::kable(digits = 3)

ParameterCoefficientSECI_lowCI_hightdf_errorp
(Intercept)-0.0060.012-0.0290.018-0.4946610.621
parenting_style0.0620.0160.0310.0933.9056610.000
sibling_aggression0.0860.045-0.0020.1741.9276610.054
computer_games0.1430.0480.0500.2373.0196610.003
diet-0.1120.049-0.208-0.015-2.2696610.024

The influence plot looks fine. There are no extreme Cook’s distances and the trend line is fairly flat.

plot(agress_full_lm, which = 5)


Repeat the analysis in Labcoat Leni’s Real Research 8.1 using bootstrapping for the confidence intervals. What are the confidence intervals for the regression parameters?

To recap, the models are fit as follows (see also the Labcoat Leni answers). Load the data directly from the discovr package:

ong_tib <- discovr::ong_2011


Fit the models that look at whether narcissism predicts, above and beyond the other variables, the frequency of status updates.

# Fit the models
ong_base_lm <- lm(status ~ sex + age + grade, data = ong_tib, na.action = na.exclude)
ong_ext_lm <- update(ong_base_lm, .~. + extraversion)
ong_ncs_lm <- update(ong_ext_lm, .~. + narcissism)

# Compare models
anova(ong_base_lm, ong_ext_lm, ong_ncs_lm) %>%
broom::tidy() %>%
knitr::kable(digits = 3)

2461481.515NANANANA
2451458.360123.1554.0170.046
2441406.600151.7598.9790.003

Fit the models that look at whether narcissism predicts, above and beyond the other variables, the Facebook profile picture ratings.

# Fit models
prof_base_lm <- lm(profile ~ sex + age + grade, data = ong_tib, na.action = na.exclude)
prof_ext_lm <- update(prof_base_lm, .~. + extraversion)
prof_ncs_lm <- update(prof_ext_lm, .~. + narcissism)

# Compare models

anova(prof_base_lm, prof_ext_lm, prof_ncs_lm) %>%
broom::tidy()  %>%
knitr::kable(digits = 3)

1882405.233NANANANA
1872104.9691300.26429.6130
1861885.9581219.01121.6000

Get the bootstrap CIs:

ong_ncs_lm %>%
parameters::model_parameters(bootstrap = TRUE, digits = 3) %>%
knitr::kable(digits = 3)

ParameterCoefficientCI_lowCI_highp
(Intercept)0.136-5.1525.1580.943
sexMale-0.947-1.572-0.3630.004
age-0.012-0.3410.3630.933
extraversion0.010-0.0490.0630.735
narcissism0.0670.0250.1070.004

The main benefit of the bootstrap confidence intervals and significance values is that they do not rely on assumptions of normality or homoscedasticity, so they give us an accurate estimate of the true population value of b for each predictor. The bootstrapped confidence intervals in the output do not affect the conclusions reported in Ong et al. (2011). Ong et al.’s prediction was still supported in that, after controlling for age, grade and sex, narcissism significantly predicted the frequency of Facebook status updates over and above extroversion, b = 0.066 [0.03, 0.10], p < 0.001.

Get the bootstrap CIs:

prof_ncs_lm %>%
parameters::model_parameters(bootstrap = TRUE) %>%
knitr::kable(digits = 3)

ParameterCoefficientCI_lowCI_highp
(Intercept)-3.843-18.9348.5640.563
sexMale0.568-0.5391.6500.344
age0.348-0.5411.4250.426
extraversion0.1090.0230.2030.006
narcissism0.1730.1040.2420.000

Similarly, the bootstrapped confidence intervals for the second regression are consistent with the conclusions reported in Ong et al. (2011). That is, after adjusting for age, grade and sex, narcissism significantly predicted the Facebook profile picture ratings over and above extroversion, b = 0.171 [0.10, 0.24], p < 0.001.

Coldwell, Pike and Dunn (2006) investigated whether household chaos predicted children’s problem behaviour over and above parenting. From 118 families they recorded the age and gender of the youngest child (child_age and child_gender). They measured dimensions of the child’s perceived relationship with their mum: (1) warmth/enjoyment (child_warmth), and (2) anger/hostility (child_anger). Higher scores indicate more warmth/enjoyment and anger/hostility respectively. They measured the mum’s perceived relationship with her child, resulting in dimensions of positivity (mum_pos) and negativity (mum_neg). Household chaos (chaos) was assessed. The outcome variable was the child’s adjustment (sdq): the higher the score, the more problem behaviour the child was reported to be displaying. Conduct a hierarchical linear model in three steps: (1) enter child age and gender; (2) add the variables measuring parent-child positivity, parent-child negativity, parent-child warmth, parent-child anger; (3) add chaos. Is household chaos predictive of children’s problem behaviour over and above parenting? (coldwell_2006.csv).

chaos_tib <- discovr::coldwell_2006


Write a function to round values in the output!

round_values <- function(tibble, digits = 3){
require(dplyr)
tibble <- tibble %>%
dplyr::mutate(
dplyr::across(where(is.numeric), ~round(., 3))
)
return(tibble)
}


Model 1: sdq predicted from child_age and child_gender:

chaos_base_lm <- lm(sdq ~ child_age + child_age, data = chaos_tib, na.action = na.exclude)

chaos_base_lm %>%
broom::glance() %>%
round_values() %>%
knitr::kable()

0.001-0.0090.1720.0950.759137.285-68.569-60.5513.121105107
chaos_base_lm %>%
broom::tidy() %>%
round_values() %>%
knitr::kable()

termestimatestd.errorstatisticp.value
(Intercept)1.2470.1448.6420.000
child_age0.0010.0020.3080.759

Model 2: In a new block, add child_anger, child_warmth, mum_pos and mum_neg into the model:

chaos_emo_lm <- update(chaos_base_lm, .~. + child_anger + child_warmth + mum_pos + mum_neg)
chaos_emo_lm %>%
broom::glance() %>%
round_values() %>%
knitr::kable()

0.0610.0080.1741.1590.335534.65-55.301-37.352.7319096
chaos_emo_lm %>%
broom::tidy() %>%
round_values() %>%
knitr::kable()

termestimatestd.errorstatisticp.value
(Intercept)0.9430.3322.8430.006
child_age0.0010.0020.5900.556
child_anger0.0070.0170.3810.704
child_warmth-0.0080.025-0.3400.735
mum_pos0.0030.0050.5290.598
mum_neg0.0130.0062.1580.034

Model 3: In a final block, add chaos to the model:

chaos_chaos_lm <- update(chaos_emo_lm, .~. + chaos)
chaos_chaos_lm %>%
broom::glance() %>%
round_values() %>%
knitr::kable()

0.1040.0440.1711.7250.124636.934-57.867-37.3522.6048996
chaos_chaos_lm %>%
broom::tidy() %>%
round_values() %>%
knitr::kable()

termestimatestd.errorstatisticp.value
(Intercept)0.7630.3372.2660.026
child_age0.0020.0020.6560.514
child_anger0.0030.0170.1900.850
child_warmth-0.0020.025-0.0930.926
mum_pos0.0030.0050.5070.613
mum_neg0.0110.0061.8590.066
chaos0.0740.0362.0820.040
chaos_chaos_lm %>%
parameters::model_parameters(standardize = "refit", digits = 3) %>%
knitr::kable()

ParameterCoefficientSECI_lowCI_hightdf_errorp
(Intercept)0.00000000.0998042-0.19830880.19830880.0000000891.0000000
child_age0.06759260.1030422-0.13715010.27233530.6559696890.5135360
child_anger0.01989990.1047546-0.18824530.22804510.1899668890.8497677
child_warmth-0.01013730.1088309-0.22638180.2061073-0.0931469890.9259962
mum_pos0.05236310.1032264-0.15274560.25747180.5072645890.6132239
mum_neg0.19887850.1070043-0.01373680.41149391.8586029890.0663887
chaos0.21638310.10391700.00990230.42286392.0822697890.0401883

We can’t compare the models because they’re based on different amounts of data (missing values)

From the output we can conclude that household chaos significantly predicted younger sibling’s problem behaviour over and above maternal parenting, child age and gender, t(88) = 2.08, p = 0.040. The positive standardized beta value (0.216) indicates that there is a positive relationship between household chaos and child’s problem behaviour. In other words, the higher the level of household chaos, the more problem behaviours the child displayed. The value of $$R^2$$ (0.10) tells us that household chaos accounts for 10% of the variance in child problem behaviour.