library(tidyr) #Allows for us to manipulate the data structure
library(data.table) #Allows for us to manipulate the data structure
library(lme4) #Allows us to fit mixed effects models
library(lattice) #for plotting random effects
library(nlme)
7 Model Selection
7.1 Comparing Mean Models
In this section, we will apply the backward selection method to
Mod_base: \(E(y_{ij} \mid X_{ij}) = \beta_0 + \beta_{trt}trt + \beta_{T}T + \beta_{T^2}T^2 + \beta_{trt:T}trt*T+\beta_{trt:T^2}trt*T^2\).
# start off with baseline model including all terms
<- gls(measurements ~ factor(group) + time + I(time^2) + factor(group):(time +
mod_base_unstruc I(time^2)), data = long_TLC, corr = corSymm(form = ~1 | id), weights = varIdent(form = ~1 |
method = "REML")
time),
summary(mod_base_unstruc)
Generalized least squares fit by REML
Model: measurements ~ factor(group) + time + I(time^2) + factor(group):(time + I(time^2))
Data: long_TLC
AIC BIC logLik
2551.458 2615.08 -1259.729
Correlation Structure: General
Formula: ~1 | id
Parameter estimate(s):
Correlation:
1 2 3
2 0.383
3 0.554 0.674
4 0.479 0.645 0.561
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | time
Parameter estimates:
0 1 4 6
1.000000 1.559260 1.353064 1.599957
Coefficients:
Value Std.Error t-value p-value
(Intercept) 26.131925 0.7026920 37.18830 0.0000
factor(group)Treatment -0.780544 0.9937566 -0.78545 0.4327
time -0.833454 0.5657092 -1.47329 0.1415
I(time^2) 0.082735 0.0983326 0.84138 0.4006
factor(group)Treatment:time -5.996162 0.8000336 -7.49489 0.0000
factor(group)Treatment:I(time^2) 1.037344 0.1390633 7.45951 0.0000
Correlation:
(Intr) fct()T time I(t^2) fc()T:
factor(group)Treatment -0.707
time -0.169 0.119
I(time^2) 0.156 -0.110 -0.976
factor(group)Treatment:time 0.119 -0.169 -0.707 0.690
factor(group)Treatment:I(time^2) -0.110 0.156 0.690 -0.707 -0.976
Standardized residuals:
Min Q1 Med Q3 Max
-2.5279798 -0.7879764 -0.2643423 0.4368118 4.8117601
Residual standard error: 5.092279
Degrees of freedom: 400 total; 394 residual
# If there were higher order terms that were significant we would remove them,
# but in this case both interaction terms are significant so we want to keep
# them in the model
# However, suppose the quadratic terms were not significant, lets remove them
# here in this example
<- gls(measurements ~ factor(group) + time + factor(group):(time),
mod_second_unstruc data = long_TLC, corr = corSymm(form = ~1 | id), weights = varIdent(form = ~1 |
method = "REML")
time),
summary(mod_second_unstruc)
Generalized least squares fit by REML
Model: measurements ~ factor(group) + time + factor(group):(time)
Data: long_TLC
AIC BIC logLik
2586.646 2642.386 -1279.323
Correlation Structure: General
Formula: ~1 | id
Parameter estimate(s):
Correlation:
1 2 3
2 0.132
3 0.271 0.846
4 0.543 0.406 0.495
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | time
Parameter estimates:
0 1 4 6
1.000000 1.833984 1.551842 1.439514
Coefficients:
Value Std.Error t-value p-value
(Intercept) 26.039582 0.6940593 37.51780 0.0000
factor(group)Treatment -1.938410 0.9815481 -1.97485 0.0490
time -0.368743 0.1223416 -3.01404 0.0027
factor(group)Treatment:time -0.169559 0.1730172 -0.98001 0.3277
Correlation:
(Intr) fct()T time
factor(group)Treatment -0.707
time -0.077 0.054
factor(group)Treatment:time 0.054 -0.077 -0.707
Standardized residuals:
Min Q1 Med Q3 Max
-2.2991338 -0.7973523 -0.2854752 0.3570825 5.6284768
Residual standard error: 5.310688
Degrees of freedom: 400 total; 396 residual
Note that all terms are now significant, including the interaction term. We used the backwards selection method to obtain the “best” model. Suppose we wanted to compare the mean models of mod_base_unstruc and mod_second_unstruc. We can compare model means using anova()
anova(mod_base_unstruc, mod_second_unstruc)
Model df AIC BIC logLik Test L.Ratio p-value
mod_base_unstruc 1 16 2551.458 2615.080 -1259.729
mod_second_unstruc 2 14 2586.646 2642.386 -1279.323 1 vs 2 39.18782 <.0001
Based on these results, which is the best model?
Answer: The model called “mod_base_unstruc” is the best model which is consistent with our previous results
7.2 Comparing Covariance Models
In previous section, we fit three models on our data using various covariance structure below:
# unstructured covariance
<- gls(measurements ~ factor(group) + factor(level) + factor(group):factor(level),
mod_1_unstruc data = long_TLC, corr = corSymm(form = ~1 | id), weights = varIdent(form = ~1 |
as.factor(level)), method = "REML")
# exchangeable (constant variances)
<- gls(measurements ~ factor(group) + factor(level) + factor(group):factor(level),
mod_1_exch_const data = long_TLC, corr = corCompSymm(form = ~factor(level) | id), weights = varIdent(form = ~1),
method = "REML")
# exchangeable (heterogeneous variances)
<- gls(measurements ~ factor(group) + factor(level) + factor(group):factor(level),
mod_1_exch_heter data = long_TLC, corr = corCompSymm(form = ~factor(level) | id), weights = varIdent(form = ~1 |
factor(level)), method = "REML")
We can compare these models using anova
.
# compare three models together
anova(mod_1_unstruc, mod_1_exch_const, mod_1_exch_heter)
Model df AIC BIC logLik Test L.Ratio p-value
mod_1_unstruc 1 18 2452.076 2523.559 -1208.038
mod_1_exch_const 2 10 2480.621 2520.334 -1230.311 1 vs 2 44.54508 <.0001
mod_1_exch_heter 3 13 2459.960 2511.587 -1216.980 2 vs 3 26.66058 <.0001
# or compare two models at a time
anova(mod_1_unstruc, mod_1_exch_const)
Model df AIC BIC logLik Test L.Ratio p-value
mod_1_unstruc 1 18 2452.076 2523.559 -1208.038
mod_1_exch_const 2 10 2480.621 2520.334 -1230.311 1 vs 2 44.54508 <.0001
anova(mod_1_exch_const, mod_1_exch_heter)
Model df AIC BIC logLik Test L.Ratio p-value
mod_1_exch_const 1 10 2480.621 2520.334 -1230.31
mod_1_exch_heter 2 13 2459.960 2511.587 -1216.98 1 vs 2 26.66058 <.0001
For model selection, we want to compare AIC and BIC. The model with the unstructured covariance (e.g., mod_1_unstruc) has the lowest AIC. The model with the exchangeable (heterogeneous variances) (e.g., mod_1_exch_heter) has the lowest BIC. These results suggest that the exchangeable (heterogeneous variances) fits the data best since it is more parsimonious and has low information. Based on the likelihood ratio test, the mod_1_exch_heter model is a significantly better fit than the other two models.