7  Model Selection

library(tidyr)  #Allows for us to manipulate the data structure
library(data.table)  #Allows for us to manipulate the data structure
library(lme4)  #Allows us to fit mixed effects models
library(lattice)  #for plotting random effects
library(nlme)

7.1 Comparing Mean Models

In this section, we will apply the backward selection method to

Mod_base: \(E(y_{ij} \mid X_{ij}) = \beta_0 + \beta_{trt}trt + \beta_{T}T + \beta_{T^2}T^2 + \beta_{trt:T}trt*T+\beta_{trt:T^2}trt*T^2\).

# start off with baseline model including all terms
mod_base_unstruc <- gls(measurements ~ factor(group) + time + I(time^2) + factor(group):(time +
    I(time^2)), data = long_TLC, corr = corSymm(form = ~1 | id), weights = varIdent(form = ~1 |
    time), method = "REML")

summary(mod_base_unstruc)
Generalized least squares fit by REML
  Model: measurements ~ factor(group) + time + I(time^2) + factor(group):(time +      I(time^2)) 
  Data: long_TLC 
       AIC     BIC    logLik
  2551.458 2615.08 -1259.729

Correlation Structure: General
 Formula: ~1 | id 
 Parameter estimate(s):
 Correlation: 
  1     2     3    
2 0.383            
3 0.554 0.674      
4 0.479 0.645 0.561
Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | time 
 Parameter estimates:
       0        1        4        6 
1.000000 1.559260 1.353064 1.599957 

Coefficients:
                                     Value Std.Error  t-value p-value
(Intercept)                      26.131925 0.7026920 37.18830  0.0000
factor(group)Treatment           -0.780544 0.9937566 -0.78545  0.4327
time                             -0.833454 0.5657092 -1.47329  0.1415
I(time^2)                         0.082735 0.0983326  0.84138  0.4006
factor(group)Treatment:time      -5.996162 0.8000336 -7.49489  0.0000
factor(group)Treatment:I(time^2)  1.037344 0.1390633  7.45951  0.0000

 Correlation: 
                                 (Intr) fct()T time   I(t^2) fc()T:
factor(group)Treatment           -0.707                            
time                             -0.169  0.119                     
I(time^2)                         0.156 -0.110 -0.976              
factor(group)Treatment:time       0.119 -0.169 -0.707  0.690       
factor(group)Treatment:I(time^2) -0.110  0.156  0.690 -0.707 -0.976

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-2.5279798 -0.7879764 -0.2643423  0.4368118  4.8117601 

Residual standard error: 5.092279 
Degrees of freedom: 400 total; 394 residual
# If there were higher order terms that were significant we would remove them,
# but in this case both interaction terms are significant so we want to keep
# them in the model

# However, suppose the quadratic terms were not significant, lets remove them
# here in this example
mod_second_unstruc <- gls(measurements ~ factor(group) + time + factor(group):(time),
    data = long_TLC, corr = corSymm(form = ~1 | id), weights = varIdent(form = ~1 |
        time), method = "REML")

summary(mod_second_unstruc)
Generalized least squares fit by REML
  Model: measurements ~ factor(group) + time + factor(group):(time) 
  Data: long_TLC 
       AIC      BIC    logLik
  2586.646 2642.386 -1279.323

Correlation Structure: General
 Formula: ~1 | id 
 Parameter estimate(s):
 Correlation: 
  1     2     3    
2 0.132            
3 0.271 0.846      
4 0.543 0.406 0.495
Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | time 
 Parameter estimates:
       0        1        4        6 
1.000000 1.833984 1.551842 1.439514 

Coefficients:
                                Value Std.Error  t-value p-value
(Intercept)                 26.039582 0.6940593 37.51780  0.0000
factor(group)Treatment      -1.938410 0.9815481 -1.97485  0.0490
time                        -0.368743 0.1223416 -3.01404  0.0027
factor(group)Treatment:time -0.169559 0.1730172 -0.98001  0.3277

 Correlation: 
                            (Intr) fct()T time  
factor(group)Treatment      -0.707              
time                        -0.077  0.054       
factor(group)Treatment:time  0.054 -0.077 -0.707

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-2.2991338 -0.7973523 -0.2854752  0.3570825  5.6284768 

Residual standard error: 5.310688 
Degrees of freedom: 400 total; 396 residual

Note that all terms are now significant, including the interaction term. We used the backwards selection method to obtain the “best” model. Suppose we wanted to compare the mean models of mod_base_unstruc and mod_second_unstruc. We can compare model means using anova()

anova(mod_base_unstruc, mod_second_unstruc)
                   Model df      AIC      BIC    logLik   Test  L.Ratio p-value
mod_base_unstruc       1 16 2551.458 2615.080 -1259.729                        
mod_second_unstruc     2 14 2586.646 2642.386 -1279.323 1 vs 2 39.18782  <.0001

Based on these results, which is the best model?

Answer: The model called “mod_base_unstruc” is the best model which is consistent with our previous results

7.2 Comparing Covariance Models

In previous section, we fit three models on our data using various covariance structure below:

# unstructured covariance
mod_1_unstruc <- gls(measurements ~ factor(group) + factor(level) + factor(group):factor(level),
    data = long_TLC, corr = corSymm(form = ~1 | id), weights = varIdent(form = ~1 |
        as.factor(level)), method = "REML")

# exchangeable (constant variances)
mod_1_exch_const <- gls(measurements ~ factor(group) + factor(level) + factor(group):factor(level),
    data = long_TLC, corr = corCompSymm(form = ~factor(level) | id), weights = varIdent(form = ~1),
    method = "REML")

# exchangeable (heterogeneous variances)
mod_1_exch_heter <- gls(measurements ~ factor(group) + factor(level) + factor(group):factor(level),
    data = long_TLC, corr = corCompSymm(form = ~factor(level) | id), weights = varIdent(form = ~1 |
        factor(level)), method = "REML")

We can compare these models using anova.

# compare three models together
anova(mod_1_unstruc, mod_1_exch_const, mod_1_exch_heter)
                 Model df      AIC      BIC    logLik   Test  L.Ratio p-value
mod_1_unstruc        1 18 2452.076 2523.559 -1208.038                        
mod_1_exch_const     2 10 2480.621 2520.334 -1230.311 1 vs 2 44.54508  <.0001
mod_1_exch_heter     3 13 2459.960 2511.587 -1216.980 2 vs 3 26.66058  <.0001
# or compare two models at a time
anova(mod_1_unstruc, mod_1_exch_const)
                 Model df      AIC      BIC    logLik   Test  L.Ratio p-value
mod_1_unstruc        1 18 2452.076 2523.559 -1208.038                        
mod_1_exch_const     2 10 2480.621 2520.334 -1230.311 1 vs 2 44.54508  <.0001
anova(mod_1_exch_const, mod_1_exch_heter)
                 Model df      AIC      BIC   logLik   Test  L.Ratio p-value
mod_1_exch_const     1 10 2480.621 2520.334 -1230.31                        
mod_1_exch_heter     2 13 2459.960 2511.587 -1216.98 1 vs 2 26.66058  <.0001

For model selection, we want to compare AIC and BIC. The model with the unstructured covariance (e.g., mod_1_unstruc) has the lowest AIC. The model with the exchangeable (heterogeneous variances) (e.g., mod_1_exch_heter) has the lowest BIC. These results suggest that the exchangeable (heterogeneous variances) fits the data best since it is more parsimonious and has low information. Based on the likelihood ratio test, the mod_1_exch_heter model is a significantly better fit than the other two models.