setwd("F:/Google Drive/Courses and lab meetings/2018 courses/Statistical methods/entom4940/Mixed models")
data <-
"exerc1.csv" %>%
read.csv %>%
select(IQ, sex, age, ses, distcat, school, BMI, Pb)
“Mixed modeling is rarely, if ever, covered in even upper-level statistics courses. Trying to learn it on your own is like prepping to perform your own appendectomy - and just about as painful…”
– https://thecraftofstatisticalanalysis.com/random-intercept-random-slope-models/
Mixed models are a form of linear modeling used for hierarchical data when the response variable has a normal distribution and the predictor variables are a mix of fixed and random effects. These models are also good when data points might not be fully independent of each other, for example students grouped into school or plants grouped into quadrats.
When analyzing data that involves repeated measures for the same subject, mixed models can be a better choice than a repeated measures ANOVA for a few reasons, including:
A mixed model can handle missing values, but a repeated measures ANOVA must drop the subject entirely if it is missing even a single measurement.
A mixed model can handle hierarchical clustering, but a repeated measures ANOVA cannot.
Repeated measures can be spaced at irregular intervals when using a mixed model.
Grouping factors like populations, species, sites. This is different from clustered data, where individual points may belong to more than one group at the same level.
An example of hierarchical data might be: - students clustered
A mixed model is a type of linear model with multiple predictor variables.
Y ~ B0 + B1 X1 + B2 X2 + ... + E
The coefficients tell you how much the response variable changes for one unit of change in the predictor variables. B0
is the intercept and E
the error term - the amount of variability not explained by your model.
What makes something a mixed model is that the various effect terms (e.g. B1
, B2
) are a mix of fixed and random effects.
Fixed effects are the same no matter how much or where you sample. For example, sex
is a fixed effect because you have your values male
, female
, and other
, and no matter how much data you sample you’ll still have those same factor levels.
Random effects are factors for which the levels are samples from a larger population about which we’re trying to draw conclusions. Individual subject ID
, for example, is a random effect because a) you could go out and sample more individuals who would be different from the ones you have and b) you’re inferring something about the whole population from the sub-group included in your analysis.
Another way to think about random effects is that they capture variation that exists but that isn’t relevant to your question, like variation between individual subjects for which you have repeated measures. Where you have non-independence or pseudoreplication, include a random effect for that element even if it is not significant in the model, so that reviewers know you properly accounted for it.
An interaction occurs when one variable changes how another variable affects the response variable. When including an interaction term, be sure to leave in lower order terms for each interaction.
An example from https://www.theanalysisfactor.com/the-difference-between-crossed-and-nested-factors/:
Imagine you have 18 people assigned to two training groups, and their BMI is measured at three time points. There are both crossed and nested factors in this experiment.
Crossed: Subject is crossed with time because every combination of subject and time point is represented by a value of BMI.
Nested: Each subject is assigned to only one training group, and so not every possible combination of subject and group is represented by a value of BMI. By knowing the subject ID, you know exactly which training group they belong to. Thus subject is nested within training group.
For crossed factors, you can look at an interaction term. For nested factors, you can’t.
Avoid overfitting, which occurs when your model has too small a sample size and too many predictor variables. If you do this, you’ll get a warning that your model wouldn’t converge.
Categorical variables need to be recoded as numeric before analyzing them (as.factor is one way to do this). For example, in this data, sex is already recoded from M/F to 0/1, and socioeconomic status (ses) from low, medium, and high to 0, 1, and 2.
Standardizing your explanatory variables to a mean of zero and standard deviation of one (using function scale
) is also useful. It makes it easier to compare effect sizes because all your estimated coefficients are on the same scale. Centering in this way also helps with interpreting the intercept of your model.
Let’s take a look at the data:
str(data)
## 'data.frame': 457 obs. of 8 variables:
## $ IQ : int 99 103 134 71 97 111 100 111 99 110 ...
## $ sex : int 1 1 0 0 0 0 0 1 1 1 ...
## $ age : num 6.67 7 7.17 6.92 8.5 ...
## $ ses : int 1 1 0 1 1 0 1 1 1 1 ...
## $ distcat: int 1 1 1 1 1 1 1 1 1 1 ...
## $ school : int 1 1 1 1 1 1 1 1 1 1 ...
## $ BMI : num 17 20 13 16.3 16.5 ...
## $ Pb : num 6.47 5.93 12.96 10.39 9.15 ...
summary(data)
## IQ sex age ses
## Min. : 55 Min. :0.000 Min. :6.167 Min. :0.000
## 1st Qu.: 93 1st Qu.:0.000 1st Qu.:6.667 1st Qu.:1.000
## Median :103 Median :0.000 Median :6.917 Median :1.000
## Mean :102 Mean :0.442 Mean :6.945 Mean :0.831
## 3rd Qu.:112 3rd Qu.:1.000 3rd Qu.:7.167 3rd Qu.:1.000
## Max. :145 Max. :1.000 Max. :8.500 Max. :2.000
## NA's :7 NA's :19
## distcat school BMI Pb
## Min. :0.0000 Min. : 1.000 Min. :12.74 Min. : 1.99
## 1st Qu.:0.0000 1st Qu.: 3.000 1st Qu.:15.26 1st Qu.: 7.60
## Median :1.0000 Median : 7.000 Median :16.28 Median :10.57
## Mean :0.5733 Mean : 7.112 Mean :16.94 Mean :11.98
## 3rd Qu.:1.0000 3rd Qu.:11.000 3rd Qu.:17.83 3rd Qu.:14.30
## Max. :1.0000 Max. :14.000 Max. :29.21 Max. :43.82
## NA's :3 NA's :4
Variables with mean near 0 were already standardized using the scale()
function
Q| Does the distance category of a school from a foundry, which could potentially be causing lead pollution, reduce the IQ of schoolchildren going to that school?
Just looking at a linear model of data from 14 schools near the foundry, you would think so, but there are some problems with this analysis.
m.linear <- lm(IQ ~ sex + age + ses + Pb + distcat, data = data)
summary(m.linear)
##
## Call:
## lm(formula = IQ ~ sex + age + ses + Pb + distcat, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.368 -9.048 0.653 9.698 37.630
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 128.6055 13.7790 9.333 < 2e-16 ***
## sex -2.8208 1.4476 -1.949 0.05200 .
## age -3.6602 1.9464 -1.880 0.06073 .
## ses 2.0456 1.3141 1.557 0.12029
## Pb -0.3352 0.1278 -2.623 0.00903 **
## distcat 4.3880 1.6487 2.662 0.00807 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.89 on 429 degrees of freedom
## (22 observations deleted due to missingness)
## Multiple R-squared: 0.08475, Adjusted R-squared: 0.07408
## F-statistic: 7.945 on 5 and 429 DF, p-value: 3.566e-07
anova(m.linear)
## Analysis of Variance Table
##
## Response: IQ
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 921 921.3 4.1559 0.042103 *
## age 1 1107 1106.5 4.9915 0.025986 *
## ses 1 946 945.8 4.2667 0.039465 *
## Pb 1 4262 4261.8 19.2251 1.463e-05 ***
## distcat 1 1570 1570.3 7.0837 0.008071 **
## Residuals 429 95100 221.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This analysis yields incorrect standard errors because it violates the assumption that the residuals are independent. The subjects are grouped into schools in this dataset, and that grouping is not accounted for by a linear model. Note the inflated degrees of freedom for distcat that result from its pesudoreplication.
Because the data are hierarchical, with multiple measurements at multiple levels of organization (e.g. school and student), a mixed model would be better here. We’ll use the nlme
package for our models, but the lme4
package has a very similar format.
To estimate the variablity in IQ within vs between schools, we can run a model with only the random effect school
. This lets the intercept vary with school
.
# using nlme package
m1 <- lme(IQ ~ 1, random = ~1|school, method = "REML", na.action = na.omit, data = data)
summary(m1)
## Linear mixed-effects model fit by REML
## Data: data
## AIC BIC logLik
## 3713.927 3726.249 -1853.964
##
## Random effects:
## Formula: ~1 | school
## (Intercept) Residual
## StdDev: 5.290319 14.58551
##
## Fixed effects: IQ ~ 1
## Value Std.Error DF t-value p-value
## (Intercept) 102.5002 1.586903 436 64.59134 0
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.48155658 -0.57505992 0.03681583 0.66896768 2.54776677
##
## Number of Observations: 450
## Number of Groups: 14
# using lme4 package
m2 <- lmer(IQ ~ 1 + (1|school), REML = TRUE, na.action = na.omit, data = data)
summary(m2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: IQ ~ 1 + (1 | school)
## Data: data
##
## REML criterion at convergence: 3707.9
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.4816 -0.5751 0.0368 0.6690 2.5478
##
## Random effects:
## Groups Name Variance Std.Dev.
## school (Intercept) 27.99 5.29
## Residual 212.74 14.59
## Number of obs: 450, groups: school, 14
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 102.500 1.587 64.59
The 1
s used in these models are constants used as placeholders when you aren’t using covariates there. The random covariate to the right of the bar (here, school
) allows variation in intercept with this covariate.
Parameters (like p-value, standard error, etc.) are only going to be estimated for the fixed effects, not the random effects.
Let’s add covariates to our model.
m3 <- lme(IQ ~ sex + age + ses + Pb + distcat, random = ~1|school, na.action = na.omit, data = data)
summary(m3)
## Linear mixed-effects model fit by REML
## Data: data
## AIC BIC logLik
## 3575.062 3607.554 -1779.531
##
## Random effects:
## Formula: ~1 | school
## (Intercept) Residual
## StdDev: 3.847656 14.49946
##
## Fixed effects: IQ ~ sex + age + ses + Pb + distcat
## Value Std.Error DF t-value p-value
## (Intercept) 128.67968 13.655491 417 9.423292 0.0000
## sex -2.71006 1.421269 417 -1.906792 0.0572
## age -3.63146 1.912447 417 -1.898855 0.0583
## ses 1.94821 1.297026 417 1.502057 0.1338
## Pb -0.30564 0.126659 417 -2.413125 0.0162
## distcat 4.18923 2.646270 12 1.583070 0.1394
## Correlation:
## (Intr) sex age ses Pb
## sex -0.070
## age -0.975 0.011
## ses -0.129 -0.020 0.052
## Pb -0.122 0.080 -0.029 0.041
## distcat -0.151 0.056 0.011 -0.039 0.276
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.38994659 -0.63216092 0.06822743 0.66448634 2.32685110
##
## Number of Observations: 435
## Number of Groups: 14
The estimates for each fixed effect represents the change in the response variable when that fixed effect changes by one unit, while all other fixed effects are held constant at their mean value. The estimate for the intercept gives the value of the response variable when all the other fixed effects are at zero - which is why it’s helpful to have centered your predictors.
Check the estimated variance for the random effect. If it’s indistinguishable from zero, then that random effect doesn’t matter and you can do a regular linear model instead (which has only fixed effects).
Also check the correlation between fixed effects (not with the intercept) - if fixed effects are highly correlated, you have a situation called “multicollinearity.” These highly correlated effects should not be included in one model together. Instead, you can make separate models that include either one or the other, and compare the models using model comparison to decide which covariate to keep.
What about situations when there is a different relationship between the predictor and response variables across the different levels of the random factor? You can allow each random factor to vary not only in intercept but also in slope.
Here we allow each school to have its own slope of age. This will often improve the model.
m4 <- lme(IQ ~ sex + age + ses + Pb + distcat, random = ~age|school, na.action = na.omit, data = data)
summary(m4)
## Linear mixed-effects model fit by REML
## Data: data
## AIC BIC logLik
## 3577.953 3618.568 -1778.976
##
## Random effects:
## Formula: ~age | school
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## (Intercept) 30.820707 (Intr)
## age 4.132281 -0.994
## Residual 14.429234
##
## Fixed effects: IQ ~ sex + age + ses + Pb + distcat
## Value Std.Error DF t-value p-value
## (Intercept) 133.45693 16.194295 417 8.240984 0.0000
## sex -2.83424 1.420780 417 -1.994849 0.0467
## age -4.26620 2.247452 417 -1.898236 0.0584
## ses 1.82189 1.296208 417 1.405554 0.1606
## Pb -0.30010 0.126330 417 -2.375505 0.0180
## distcat 3.69322 2.575211 12 1.434141 0.1771
## Correlation:
## (Intr) sex age ses Pb
## sex -0.065
## age -0.983 0.016
## ses -0.113 -0.020 0.047
## Pb -0.101 0.075 -0.027 0.045
## distcat -0.131 0.058 0.014 -0.029 0.284
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.43017187 -0.62288526 0.07439453 0.66379549 2.27900486
##
## Number of Observations: 435
## Number of Groups: 14
check that the model fit is improved
anova(m3,m4)
## Model df AIC BIC logLik Test L.Ratio p-value
## m3 1 8 3575.062 3607.554 -1779.531
## m4 2 10 3577.953 3618.568 -1778.976 1 vs 2 1.108999 0.5744
no improvment observed based on a liklihood ratio test. According to the principles of parsimony, the p-value would need to be below 0.05 in order to justify keeping our more complex model.
You can also include interaction terms as fixed effects. When using an interaction term, remember to still include the components of the interaction as individual fixed effects.
m5 <- lme(IQ ~ sex + age + sex * age, random = ~1|school, na.action = na.omit, data = data)
summary(m5)
## Linear mixed-effects model fit by REML
## Data: data
## AIC BIC logLik
## 3701.684 3726.286 -1844.842
##
## Random effects:
## Formula: ~1 | school
## (Intercept) Residual
## StdDev: 5.102384 14.51071
##
## Fixed effects: IQ ~ sex + age + sex * age
## Value Std.Error DF t-value p-value
## (Intercept) 132.86985 17.377401 433 7.646129 0.0000
## sex -5.68098 25.657878 433 -0.221413 0.8249
## age -4.20830 2.490453 433 -1.689772 0.0918
## sex:age 0.44793 3.685666 433 0.121533 0.9033
## Correlation:
## (Intr) sex age
## sex -0.676
## age -0.995 0.677
## sex:age 0.676 -0.999 -0.679
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.42321384 -0.60395169 0.06851778 0.67403992 2.41075782
##
## Number of Observations: 450
## Number of Groups: 14
This interaction term is not very significant and so should be removed from the model.
Check whether specific fixed and random effects are needed by comparing models. Start with your most complex model containing all your effects of interest, then drop them one by one, making a new model each time. Drop the highest-order interaction terms with the highest-order p-value first.
m6 <- lme(IQ ~ sex + age + ses + distcat + sex * age, random = ~1|school, na.action = na.omit, data = data)
summary(m6)
## Linear mixed-effects model fit by REML
## Data: data
## AIC BIC logLik
## 3599.465 3632.013 -1791.733
##
## Random effects:
## Formula: ~1 | school
## (Intercept) Residual
## StdDev: 4.061579 14.59614
##
## Fixed effects: IQ ~ sex + age + ses + distcat + sex * age
## Value Std.Error DF t-value p-value
## (Intercept) 126.83108 17.718289 420 7.158202 0.0000
## sex -5.93605 26.457664 420 -0.224360 0.8226
## age -4.06239 2.517327 420 -1.613773 0.1073
## ses 1.92242 1.299314 420 1.479569 0.1397
## distcat 6.00745 2.642216 12 2.273642 0.0422
## sex:age 0.49846 3.801605 420 0.131118 0.8957
## Correlation:
## (Intr) sex age ses distct
## sex -0.655
## age -0.991 0.661
## ses -0.110 0.005 0.054
## distcat -0.096 0.004 0.013 -0.049
## sex:age 0.654 -0.999 -0.662 -0.006 -0.002
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.31719015 -0.59891749 0.05048137 0.69607666 2.37667763
##
## Number of Observations: 438
## Number of Groups: 14
m7 <- lme(IQ ~ sex + age + ses + distcat, random = ~1|school, na.action = na.omit, data = data)
summary(m7)
## Linear mixed-effects model fit by REML
## Data: data
## AIC BIC logLik
## 3601.99 3630.485 -1793.995
##
## Random effects:
## Formula: ~1 | school
## (Intercept) Residual
## StdDev: 4.054413 14.57974
##
## Fixed effects: IQ ~ sex + age + ses + distcat
## Value Std.Error DF t-value p-value
## (Intercept) 125.31154 13.388306 421 9.359775 0.0000
## sex -2.47208 1.417782 421 -1.743624 0.0820
## age -3.84392 1.884645 421 -2.039599 0.0420
## ses 1.92354 1.297822 421 1.482133 0.1391
## distcat 6.00839 2.638062 12 2.277578 0.0419
## Correlation:
## (Intr) sex age ses
## sex -0.047
## age -0.985 -0.002
## ses -0.140 -0.018 0.067
## distcat -0.124 0.035 0.015 -0.049
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.32311635 -0.60335879 0.04583759 0.69979123 2.37919401
##
## Number of Observations: 438
## Number of Groups: 14
m8 <- lme(IQ ~ sex + age + distcat, random = ~1|school, na.action = na.omit, data = data)
summary(m8)
## Linear mixed-effects model fit by REML
## Data: data
## AIC BIC logLik
## 3697.751 3722.353 -1842.876
##
## Random effects:
## Formula: ~1 | school
## (Intercept) Residual
## StdDev: 4.187035 14.49864
##
## Fixed effects: IQ ~ sex + age + distcat
## Value Std.Error DF t-value p-value
## (Intercept) 127.57274 12.888083 434 9.898504 0.0000
## sex -2.50206 1.389976 434 -1.800077 0.0725
## age -3.95704 1.826139 434 -2.166889 0.0308
## distcat 6.11356 2.679116 12 2.281933 0.0415
## Correlation:
## (Intr) sex age
## sex -0.042
## age -0.986 -0.011
## distcat -0.139 0.041 0.019
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.33360320 -0.61168680 0.05185314 0.65908991 2.43283910
##
## Number of Observations: 450
## Number of Groups: 14
m9 <- lme(IQ ~ age + distcat, random = ~1|school, na.action = na.omit, data = data)
summary(m9)
## Linear mixed-effects model fit by REML
## Data: data
## AIC BIC logLik
## 3701.48 3721.993 -1845.74
##
## Random effects:
## Formula: ~1 | school
## (Intercept) Residual
## StdDev: 4.261853 14.53057
##
## Fixed effects: IQ ~ age + distcat
## Value Std.Error DF t-value p-value
## (Intercept) 126.58946 12.909473 435 9.805935 0.0000
## age -3.99042 1.830186 435 -2.180338 0.0298
## distcat 6.30695 2.713016 12 2.324701 0.0384
## Correlation:
## (Intr) age
## age -0.987
## distcat -0.138 0.020
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -3.4189986 -0.5865094 0.0714204 0.6599321 2.5134284
##
## Number of Observations: 450
## Number of Groups: 14
https://ase.tufts.edu/gsc/gradresources/guidetomixedmodelsinr/mixed%20model%20guide.html https://gkhajduk.github.io/2017-03-09-mixed-models/ https://www.cscu.cornell.edu/workshops/multilevel.php
The CSCU website is the source for the dataset used in this script.