setwd("F:/Google Drive/Courses and lab meetings/2018 courses/Statistical methods/entom4940/Mixed models")

data <- 
  "exerc1.csv" %>% 
  read.csv %>% 
  select(IQ, sex, age, ses, distcat, school, BMI, Pb)

Introduction to mixed models

“Mixed modeling is rarely, if ever, covered in even upper-level statistics courses. Trying to learn it on your own is like prepping to perform your own appendectomy - and just about as painful…”

– https://thecraftofstatisticalanalysis.com/random-intercept-random-slope-models/

Mixed models are a form of linear modeling used for hierarchical data when the response variable has a normal distribution and the predictor variables are a mix of fixed and random effects. These models are also good when data points might not be fully independent of each other, for example students grouped into school or plants grouped into quadrats.

Repeated measures

When analyzing data that involves repeated measures for the same subject, mixed models can be a better choice than a repeated measures ANOVA for a few reasons, including:

A mixed model can handle missing values, but a repeated measures ANOVA must drop the subject entirely if it is missing even a single measurement.
A mixed model can handle hierarchical clustering, but a repeated measures ANOVA cannot.
Repeated measures can be spaced at irregular intervals when using a mixed model.

Hierarchical data

Grouping factors like populations, species, sites. This is different from clustered data, where individual points may belong to more than one group at the same level.

An example of hierarchical data might be: - students clustered

Fixed and random effects

A mixed model is a type of linear model with multiple predictor variables.

Y ~ B0 + B1 X1 + B2 X2 + ... + E

The coefficients tell you how much the response variable changes for one unit of change in the predictor variables. B0 is the intercept and E the error term - the amount of variability not explained by your model.

What makes something a mixed model is that the various effect terms (e.g. B1, B2) are a mix of fixed and random effects.

Fixed effects are the same no matter how much or where you sample. For example, sex is a fixed effect because you have your values male, female, and other, and no matter how much data you sample you’ll still have those same factor levels.

Random effects are factors for which the levels are samples from a larger population about which we’re trying to draw conclusions. Individual subject ID, for example, is a random effect because a) you could go out and sample more individuals who would be different from the ones you have and b) you’re inferring something about the whole population from the sub-group included in your analysis.

Another way to think about random effects is that they capture variation that exists but that isn’t relevant to your question, like variation between individual subjects for which you have repeated measures. Where you have non-independence or pseudoreplication, include a random effect for that element even if it is not significant in the model, so that reviewers know you properly accounted for it.

Interaction terms

An interaction occurs when one variable changes how another variable affects the response variable. When including an interaction term, be sure to leave in lower order terms for each interaction.

Nested vs crossed

An example from https://www.theanalysisfactor.com/the-difference-between-crossed-and-nested-factors/:

Imagine you have 18 people assigned to two training groups, and their BMI is measured at three time points. There are both crossed and nested factors in this experiment.

Crossed: Subject is crossed with time because every combination of subject and time point is represented by a value of BMI.

Nested: Each subject is assigned to only one training group, and so not every possible combination of subject and group is represented by a value of BMI. By knowing the subject ID, you know exactly which training group they belong to. Thus subject is nested within training group.

For crossed factors, you can look at an interaction term. For nested factors, you can’t.

Overfitting

Avoid overfitting, which occurs when your model has too small a sample size and too many predictor variables. If you do this, you’ll get a warning that your model wouldn’t converge.

Data exploration and rescaling

Categorical variables need to be recoded as numeric before analyzing them (as.factor is one way to do this). For example, in this data, sex is already recoded from M/F to 0/1, and socioeconomic status (ses) from low, medium, and high to 0, 1, and 2.

Standardizing your explanatory variables to a mean of zero and standard deviation of one (using function scale) is also useful. It makes it easier to compare effect sizes because all your estimated coefficients are on the same scale. Centering in this way also helps with interpreting the intercept of your model.

Let’s take a look at the data:

str(data)

## 'data.frame':    457 obs. of  8 variables:
##  $ IQ     : int  99 103 134 71 97 111 100 111 99 110 ...
##  $ sex    : int  1 1 0 0 0 0 0 1 1 1 ...
##  $ age    : num  6.67 7 7.17 6.92 8.5 ...
##  $ ses    : int  1 1 0 1 1 0 1 1 1 1 ...
##  $ distcat: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ school : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ BMI    : num  17 20 13 16.3 16.5 ...
##  $ Pb     : num  6.47 5.93 12.96 10.39 9.15 ...

summary(data)

##        IQ           sex             age             ses       
##  Min.   : 55   Min.   :0.000   Min.   :6.167   Min.   :0.000  
##  1st Qu.: 93   1st Qu.:0.000   1st Qu.:6.667   1st Qu.:1.000  
##  Median :103   Median :0.000   Median :6.917   Median :1.000  
##  Mean   :102   Mean   :0.442   Mean   :6.945   Mean   :0.831  
##  3rd Qu.:112   3rd Qu.:1.000   3rd Qu.:7.167   3rd Qu.:1.000  
##  Max.   :145   Max.   :1.000   Max.   :8.500   Max.   :2.000  
##  NA's   :7                                     NA's   :19     
##     distcat           school            BMI              Pb       
##  Min.   :0.0000   Min.   : 1.000   Min.   :12.74   Min.   : 1.99  
##  1st Qu.:0.0000   1st Qu.: 3.000   1st Qu.:15.26   1st Qu.: 7.60  
##  Median :1.0000   Median : 7.000   Median :16.28   Median :10.57  
##  Mean   :0.5733   Mean   : 7.112   Mean   :16.94   Mean   :11.98  
##  3rd Qu.:1.0000   3rd Qu.:11.000   3rd Qu.:17.83   3rd Qu.:14.30  
##  Max.   :1.0000   Max.   :14.000   Max.   :29.21   Max.   :43.82  
##                                    NA's   :3       NA's   :4

Variables with mean near 0 were already standardized using the scale() function

Analysis

Linear model

Q| Does the distance category of a school from a foundry, which could potentially be causing lead pollution, reduce the IQ of schoolchildren going to that school?

Just looking at a linear model of data from 14 schools near the foundry, you would think so, but there are some problems with this analysis.

m.linear <- lm(IQ ~ sex + age + ses + Pb + distcat, data = data)
summary(m.linear)

## 
## Call:
## lm(formula = IQ ~ sex + age + ses + Pb + distcat, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.368  -9.048   0.653   9.698  37.630 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 128.6055    13.7790   9.333  < 2e-16 ***
## sex          -2.8208     1.4476  -1.949  0.05200 .  
## age          -3.6602     1.9464  -1.880  0.06073 .  
## ses           2.0456     1.3141   1.557  0.12029    
## Pb           -0.3352     0.1278  -2.623  0.00903 ** 
## distcat       4.3880     1.6487   2.662  0.00807 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.89 on 429 degrees of freedom
##   (22 observations deleted due to missingness)
## Multiple R-squared:  0.08475,    Adjusted R-squared:  0.07408 
## F-statistic: 7.945 on 5 and 429 DF,  p-value: 3.566e-07

anova(m.linear)

## Analysis of Variance Table
## 
## Response: IQ
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## sex         1    921   921.3  4.1559  0.042103 *  
## age         1   1107  1106.5  4.9915  0.025986 *  
## ses         1    946   945.8  4.2667  0.039465 *  
## Pb          1   4262  4261.8 19.2251 1.463e-05 ***
## distcat     1   1570  1570.3  7.0837  0.008071 ** 
## Residuals 429  95100   221.7                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This analysis yields incorrect standard errors because it violates the assumption that the residuals are independent. The subjects are grouped into schools in this dataset, and that grouping is not accounted for by a linear model. Note the inflated degrees of freedom for distcat that result from its pesudoreplication.

Mixed model

Because the data are hierarchical, with multiple measurements at multiple levels of organization (e.g. school and student), a mixed model would be better here. We’ll use the nlme package for our models, but the lme4 package has a very similar format.

To estimate the variablity in IQ within vs between schools, we can run a model with only the random effect school. This lets the intercept vary with school.

# using nlme package

m1 <- lme(IQ ~ 1, random = ~1|school, method = "REML", na.action = na.omit, data = data)
summary(m1)

## Linear mixed-effects model fit by REML
##  Data: data 
##        AIC      BIC    logLik
##   3713.927 3726.249 -1853.964
## 
## Random effects:
##  Formula: ~1 | school
##         (Intercept) Residual
## StdDev:    5.290319 14.58551
## 
## Fixed effects: IQ ~ 1 
##                Value Std.Error  DF  t-value p-value
## (Intercept) 102.5002  1.586903 436 64.59134       0
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.48155658 -0.57505992  0.03681583  0.66896768  2.54776677 
## 
## Number of Observations: 450
## Number of Groups: 14

# using lme4 package

m2 <- lmer(IQ ~ 1 + (1|school), REML = TRUE, na.action = na.omit, data = data)
summary(m2)

## Linear mixed model fit by REML ['lmerMod']
## Formula: IQ ~ 1 + (1 | school)
##    Data: data
## 
## REML criterion at convergence: 3707.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.4816 -0.5751  0.0368  0.6690  2.5478 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  school   (Intercept)  27.99    5.29   
##  Residual             212.74   14.59   
## Number of obs: 450, groups:  school, 14
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  102.500      1.587   64.59

The 1s used in these models are constants used as placeholders when you aren’t using covariates there. The random covariate to the right of the bar (here, school) allows variation in intercept with this covariate.

Parameters (like p-value, standard error, etc.) are only going to be estimated for the fixed effects, not the random effects.

Let’s add covariates to our model.

m3 <- lme(IQ ~ sex + age + ses + Pb + distcat, random = ~1|school, na.action = na.omit, data = data)
summary(m3)

## Linear mixed-effects model fit by REML
##  Data: data 
##        AIC      BIC    logLik
##   3575.062 3607.554 -1779.531
## 
## Random effects:
##  Formula: ~1 | school
##         (Intercept) Residual
## StdDev:    3.847656 14.49946
## 
## Fixed effects: IQ ~ sex + age + ses + Pb + distcat 
##                 Value Std.Error  DF   t-value p-value
## (Intercept) 128.67968 13.655491 417  9.423292  0.0000
## sex          -2.71006  1.421269 417 -1.906792  0.0572
## age          -3.63146  1.912447 417 -1.898855  0.0583
## ses           1.94821  1.297026 417  1.502057  0.1338
## Pb           -0.30564  0.126659 417 -2.413125  0.0162
## distcat       4.18923  2.646270  12  1.583070  0.1394
##  Correlation: 
##         (Intr) sex    age    ses    Pb    
## sex     -0.070                            
## age     -0.975  0.011                     
## ses     -0.129 -0.020  0.052              
## Pb      -0.122  0.080 -0.029  0.041       
## distcat -0.151  0.056  0.011 -0.039  0.276
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.38994659 -0.63216092  0.06822743  0.66448634  2.32685110 
## 
## Number of Observations: 435
## Number of Groups: 14

The estimates for each fixed effect represents the change in the response variable when that fixed effect changes by one unit, while all other fixed effects are held constant at their mean value. The estimate for the intercept gives the value of the response variable when all the other fixed effects are at zero - which is why it’s helpful to have centered your predictors.

Check the estimated variance for the random effect. If it’s indistinguishable from zero, then that random effect doesn’t matter and you can do a regular linear model instead (which has only fixed effects).

Also check the correlation between fixed effects (not with the intercept) - if fixed effects are highly correlated, you have a situation called “multicollinearity.” These highly correlated effects should not be included in one model together. Instead, you can make separate models that include either one or the other, and compare the models using model comparison to decide which covariate to keep.

Random slopes

What about situations when there is a different relationship between the predictor and response variables across the different levels of the random factor? You can allow each random factor to vary not only in intercept but also in slope.

Here we allow each school to have its own slope of age. This will often improve the model.

m4 <- lme(IQ ~ sex + age + ses + Pb + distcat, random = ~age|school, na.action = na.omit, data = data)
summary(m4)

## Linear mixed-effects model fit by REML
##  Data: data 
##        AIC      BIC    logLik
##   3577.953 3618.568 -1778.976
## 
## Random effects:
##  Formula: ~age | school
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev    Corr  
## (Intercept) 30.820707 (Intr)
## age          4.132281 -0.994
## Residual    14.429234       
## 
## Fixed effects: IQ ~ sex + age + ses + Pb + distcat 
##                 Value Std.Error  DF   t-value p-value
## (Intercept) 133.45693 16.194295 417  8.240984  0.0000
## sex          -2.83424  1.420780 417 -1.994849  0.0467
## age          -4.26620  2.247452 417 -1.898236  0.0584
## ses           1.82189  1.296208 417  1.405554  0.1606
## Pb           -0.30010  0.126330 417 -2.375505  0.0180
## distcat       3.69322  2.575211  12  1.434141  0.1771
##  Correlation: 
##         (Intr) sex    age    ses    Pb    
## sex     -0.065                            
## age     -0.983  0.016                     
## ses     -0.113 -0.020  0.047              
## Pb      -0.101  0.075 -0.027  0.045       
## distcat -0.131  0.058  0.014 -0.029  0.284
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.43017187 -0.62288526  0.07439453  0.66379549  2.27900486 
## 
## Number of Observations: 435
## Number of Groups: 14

check that the model fit is improved

anova(m3,m4)

##    Model df      AIC      BIC    logLik   Test  L.Ratio p-value
## m3     1  8 3575.062 3607.554 -1779.531                        
## m4     2 10 3577.953 3618.568 -1778.976 1 vs 2 1.108999  0.5744

no improvment observed based on a liklihood ratio test. According to the principles of parsimony, the p-value would need to be below 0.05 in order to justify keeping our more complex model.

Interactions

You can also include interaction terms as fixed effects. When using an interaction term, remember to still include the components of the interaction as individual fixed effects.

m5 <- lme(IQ ~ sex + age + sex * age, random = ~1|school, na.action = na.omit, data = data)
summary(m5)

## Linear mixed-effects model fit by REML
##  Data: data 
##        AIC      BIC    logLik
##   3701.684 3726.286 -1844.842
## 
## Random effects:
##  Formula: ~1 | school
##         (Intercept) Residual
## StdDev:    5.102384 14.51071
## 
## Fixed effects: IQ ~ sex + age + sex * age 
##                 Value Std.Error  DF   t-value p-value
## (Intercept) 132.86985 17.377401 433  7.646129  0.0000
## sex          -5.68098 25.657878 433 -0.221413  0.8249
## age          -4.20830  2.490453 433 -1.689772  0.0918
## sex:age       0.44793  3.685666 433  0.121533  0.9033
##  Correlation: 
##         (Intr) sex    age   
## sex     -0.676              
## age     -0.995  0.677       
## sex:age  0.676 -0.999 -0.679
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.42321384 -0.60395169  0.06851778  0.67403992  2.41075782 
## 
## Number of Observations: 450
## Number of Groups: 14

This interaction term is not very significant and so should be removed from the model.

Model comparison using p-values

Check whether specific fixed and random effects are needed by comparing models. Start with your most complex model containing all your effects of interest, then drop them one by one, making a new model each time. Drop the highest-order interaction terms with the highest-order p-value first.

m6 <- lme(IQ ~ sex + age + ses + distcat + sex * age, random = ~1|school, na.action = na.omit, data = data) 
summary(m6)

## Linear mixed-effects model fit by REML
##  Data: data 
##        AIC      BIC    logLik
##   3599.465 3632.013 -1791.733
## 
## Random effects:
##  Formula: ~1 | school
##         (Intercept) Residual
## StdDev:    4.061579 14.59614
## 
## Fixed effects: IQ ~ sex + age + ses + distcat + sex * age 
##                 Value Std.Error  DF   t-value p-value
## (Intercept) 126.83108 17.718289 420  7.158202  0.0000
## sex          -5.93605 26.457664 420 -0.224360  0.8226
## age          -4.06239  2.517327 420 -1.613773  0.1073
## ses           1.92242  1.299314 420  1.479569  0.1397
## distcat       6.00745  2.642216  12  2.273642  0.0422
## sex:age       0.49846  3.801605 420  0.131118  0.8957
##  Correlation: 
##         (Intr) sex    age    ses    distct
## sex     -0.655                            
## age     -0.991  0.661                     
## ses     -0.110  0.005  0.054              
## distcat -0.096  0.004  0.013 -0.049       
## sex:age  0.654 -0.999 -0.662 -0.006 -0.002
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.31719015 -0.59891749  0.05048137  0.69607666  2.37667763 
## 
## Number of Observations: 438
## Number of Groups: 14

m7 <- lme(IQ ~ sex + age + ses + distcat, random = ~1|school, na.action = na.omit, data = data) 
summary(m7)

## Linear mixed-effects model fit by REML
##  Data: data 
##       AIC      BIC    logLik
##   3601.99 3630.485 -1793.995
## 
## Random effects:
##  Formula: ~1 | school
##         (Intercept) Residual
## StdDev:    4.054413 14.57974
## 
## Fixed effects: IQ ~ sex + age + ses + distcat 
##                 Value Std.Error  DF   t-value p-value
## (Intercept) 125.31154 13.388306 421  9.359775  0.0000
## sex          -2.47208  1.417782 421 -1.743624  0.0820
## age          -3.84392  1.884645 421 -2.039599  0.0420
## ses           1.92354  1.297822 421  1.482133  0.1391
## distcat       6.00839  2.638062  12  2.277578  0.0419
##  Correlation: 
##         (Intr) sex    age    ses   
## sex     -0.047                     
## age     -0.985 -0.002              
## ses     -0.140 -0.018  0.067       
## distcat -0.124  0.035  0.015 -0.049
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.32311635 -0.60335879  0.04583759  0.69979123  2.37919401 
## 
## Number of Observations: 438
## Number of Groups: 14

m8 <- lme(IQ ~ sex + age + distcat, random = ~1|school, na.action = na.omit, data = data) 
summary(m8)

## Linear mixed-effects model fit by REML
##  Data: data 
##        AIC      BIC    logLik
##   3697.751 3722.353 -1842.876
## 
## Random effects:
##  Formula: ~1 | school
##         (Intercept) Residual
## StdDev:    4.187035 14.49864
## 
## Fixed effects: IQ ~ sex + age + distcat 
##                 Value Std.Error  DF   t-value p-value
## (Intercept) 127.57274 12.888083 434  9.898504  0.0000
## sex          -2.50206  1.389976 434 -1.800077  0.0725
## age          -3.95704  1.826139 434 -2.166889  0.0308
## distcat       6.11356  2.679116  12  2.281933  0.0415
##  Correlation: 
##         (Intr) sex    age   
## sex     -0.042              
## age     -0.986 -0.011       
## distcat -0.139  0.041  0.019
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.33360320 -0.61168680  0.05185314  0.65908991  2.43283910 
## 
## Number of Observations: 450
## Number of Groups: 14

m9 <- lme(IQ ~ age + distcat, random = ~1|school, na.action = na.omit, data = data) 
summary(m9)

## Linear mixed-effects model fit by REML
##  Data: data 
##       AIC      BIC   logLik
##   3701.48 3721.993 -1845.74
## 
## Random effects:
##  Formula: ~1 | school
##         (Intercept) Residual
## StdDev:    4.261853 14.53057
## 
## Fixed effects: IQ ~ age + distcat 
##                 Value Std.Error  DF   t-value p-value
## (Intercept) 126.58946 12.909473 435  9.805935  0.0000
## age          -3.99042  1.830186 435 -2.180338  0.0298
## distcat       6.30695  2.713016  12  2.324701  0.0384
##  Correlation: 
##         (Intr) age   
## age     -0.987       
## distcat -0.138  0.020
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -3.4189986 -0.5865094  0.0714204  0.6599321  2.5134284 
## 
## Number of Observations: 450
## Number of Groups: 14

Online resources

https://ase.tufts.edu/gsc/gradresources/guidetomixedmodelsinr/mixed%20model%20guide.html https://gkhajduk.github.io/2017-03-09-mixed-models/ https://www.cscu.cornell.edu/workshops/multilevel.php

The CSCU website is the source for the dataset used in this script.

Mixed Effects Models

Elizabeth Bergen, Kara Fikrig & Heather Grab

April 24, 2018