2  Marginal Model (GEE)

2.1 Introduction

For this section we will return to the TLC dataset with N = 100 participants in order to better understand marginal models for longitudinal data when the outcome is continuous.

2.2 Required Packages

We will use the following packages:

  • nlme: This package is for fitting and comparing Gaussian linear and nonlinear mixed-effects models. The variance-covariance structures for the residuals can be specified, and it is useful for data with repeated measures or longitudinal designs. We will use the gls() function to fit a linear model using generalized least squares.

  • lme4: This package is for fitting mixed-effects models. We will use the lmer() function to fit a linear model.

These packages are for applying the generalized estimating equations (GEE) approach for fitting marginal generalized linear models to data with repeated measures or longitudinal designs:

  • gee: This is the “Generalized Estimation Equation Solver” package.

  • gpack: This is the “Generalized Estimating Equation Package” package.

  • geeM: This is the “Solve Generalized Estimating Equations” package.

library(tidyr)  #Allows for us to manipulate the data structure
library(data.table)  #Allows for us to manipulate the data structure
library(ggplot2)  #this makes better looking plots in R

# new packages for models
library(nlme)  #used for the gls() function
library(gee)  # install.packages('gee'), used for the gee() function
library(geepack)  # install.packages('geepack'), used for the geeglm() function
library(geeM)  # install.packages('geeM'), used for the geeglm() function
theme_set(theme_minimal())

2.3 Constructing Marginal Models

Consider a continuous outcome \(Y\) which indicates a patient’s blood lead level (in micrograms per deciLiter \(\mu\)g/dL). The correct way to write Y when doing a model like this would be as \(Y_{ij}\). \(Y_{ij}\) is the lead blood level of patient i at time point j. Likewise the covariates \(X = (1,X^1, \dots,X^p)\) would be as \(X_{ij} = (1,X^1_{ij},\dots X^p_{ij})\), the p covariates for individual i measured at time point j. 1 is simply the intercept. The superscript here just represents the \(k-th\) covariate in the analysis (i.e. in other words we have p covariates).

In these different modelling strategies we always specify a link between the mean of our distribution and the data, \(E(y_{ij} \mid X_{ij}) = \mu_{ij} = g^{-1}(X^T_{ij}\beta)\). Here \(g^{-1}\) is what is commonly referred to as a link function. It is exactly what it sounds like. A function which links the mean of your distribution to your data. There are many common link functions that you’ve probably already dealt with in other regression courses. For example there is the logit link function \(g(\mu) =log(\frac{\mu}{1-\mu}) = X^T\beta\) that one uses when doing logistic function.

When one is doing linear regression a common link function is just the identity link \(g(\mu) =\mu = X^T\beta\).

Here we introduce a few functions and corresponding packages to fit these models:

1. Using the gls() function to fit a linear model using generalized least squares

The syntax for the function gls() in nlme package is

gls(model, data, correlation, weights, subset, method, na.action, control, verbose)

  • Description
    • model: A two-sided linear formula object describing the model
    • data: Optional dataframe
    • correlation: See description for details.
      • The corStruct object describes the within-group correlation structure.
      • Note we use corCompSymm which has the following syntax: corCompSymm(value, form, fixed) compound symmetry structure corresponding to a constant correlation.
    • weights: See description for details.
      • Describes the variance function.
      • Note we use varClasses which defines standard classes of variance function structures and use varIdent which describes constant variance(s), that is generally used to allow different variances.

We consider the first model discussed in lecture. This model includes time as a categorical covariate and also includes an interaction term to account and adjust for any possible discrepancies between the treatment and control.

Model 1: \[E(y_{ij} \mid X_{ij}) = \beta_0 + \beta_{trt}trt + \beta_{T_1}T_1 + \beta_{T_4}T_4 + \beta_{T_6}T_6 + \] \[\beta_{trt:T_1}trt*T_1+ \beta_{trt:T_4}trt*T_4 + \beta_{trt:T_6}trt*T_6\]

# Model fit with compound symmetry heterogeneous variances

# Model 1: Response Profiles (time T categorical)
fit.compsym <- gls(measurements ~ factor(group) * factor(level), data = long_TLC,
    corr = corCompSymm(, form = ~factor(level) | id), weights = varIdent(form = ~1 |
        factor(level)))
summary(fit.compsym)
Generalized least squares fit by REML
  Model: measurements ~ factor(group) * factor(level) 
  Data: long_TLC 
      AIC      BIC   logLik
  2459.96 2511.587 -1216.98

Correlation Structure: Compound symmetry
 Formula: ~factor(level) | id 
 Parameter estimate(s):
      Rho 
0.6102708 
Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | factor(level) 
 Parameter estimates:
   lead0    lead1    lead4    lead6 
1.000000 1.279672 1.323197 1.519220 

Coefficients:
                                            Value Std.Error   t-value p-value
(Intercept)                                26.272 0.7237951  36.29757  0.0000
factor(group)Treatment                      0.268 1.0236008   0.26182  0.7936
factor(level)lead1                         -1.612 0.7506796  -2.14739  0.0324
factor(level)lead4                         -2.202 0.7713883  -2.85459  0.0045
factor(level)lead6                         -2.626 0.8726932  -3.00908  0.0028
factor(group)Treatment:factor(level)lead1 -11.406 1.0616213 -10.74394  0.0000
factor(group)Treatment:factor(level)lead4  -8.824 1.0909078  -8.08868  0.0000
factor(group)Treatment:factor(level)lead6  -3.152 1.2341746  -2.55393  0.0110

 Correlation: 
                                          (Intr) fct()T fct()1 fct()4 fct()6
factor(group)Treatment                    -0.707                            
factor(level)lead1                        -0.211  0.149                     
factor(level)lead4                        -0.181  0.128  0.402              
factor(level)lead6                        -0.060  0.043  0.383  0.383       
factor(group)Treatment:factor(level)lead1  0.149 -0.211 -0.707 -0.285 -0.270
factor(group)Treatment:factor(level)lead4  0.128 -0.181 -0.285 -0.707 -0.271
factor(group)Treatment:factor(level)lead6  0.043 -0.060 -0.270 -0.271 -0.707
                                          f()T:()1 f()T:()4
factor(group)Treatment                                     
factor(level)lead1                                         
factor(level)lead4                                         
factor(level)lead6                                         
factor(group)Treatment:factor(level)lead1                  
factor(group)Treatment:factor(level)lead4  0.402           
factor(group)Treatment:factor(level)lead6  0.383    0.383  

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-2.1429199 -0.6927789 -0.1528885  0.5263121  5.5480303 

Residual standard error: 5.118004 
Degrees of freedom: 400 total; 392 residual
anova(fit.compsym)
Denom. DF: 392 
                            numDF   F-value p-value
(Intercept)                     1 2438.1366  <.0001
factor(group)                   1    7.1161   0.008
factor(level)                   3   80.5490  <.0001
factor(group):factor(level)     3   46.9288  <.0001

2. Next, we discuss Marginal Models (GEE) and three R Packages

1. Using the gee package

The syntax for the function gee() in gee package is

gee(formula, id, data, family = gaussian, constr = ‘independence’,Mv)

  • Using the GEE Package
    • formula: Symbolic description of the model to be fitted
    • family: Description of the error distribution and link function
    • data: Optional dataframe
    • id: Vector that identifies the clusters
    • constr: Working correlation structure: “independence”, “exchangeable”, “AR-M”, “unstructured”
    • Mv: order of AR correlation (AR1: Mv = 1)

Consider the same model 1 defined earlier:

# Model 1: Response Profiles (time T categorical)
mod_tlc_gee <- gee(measurements ~ level + group + level * group, id = id, data = long_TLC,
    family = gaussian(link = "identity"), corstr = "unstructured")
              (Intercept)                levellead1                levellead4 
                   26.272                    -1.612                    -2.202 
               levellead6            groupTreatment levellead1:groupTreatment 
                   -2.626                     0.268                   -11.406 
levellead4:groupTreatment levellead6:groupTreatment 
                   -8.824                    -3.152 
summary(mod_tlc_gee)

 GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
 gee S-function, version 4.13 modified 98/01/27 (1998) 

Model:
 Link:                      Identity 
 Variance to Mean Relation: Gaussian 
 Correlation Structure:     Unstructured 

Call:
gee(formula = measurements ~ level + group + level * group, id = id, 
    data = long_TLC, family = gaussian(link = "identity"), corstr = "unstructured")

Summary of Residuals:
        Min          1Q      Median          3Q         Max 
-16.6620002  -4.6205000  -0.9930005   3.6724993  43.1380019 


Coefficients:
                          Estimate Naive S.E.    Naive z Robust S.E.
(Intercept)                 26.272  0.9370175 28.0378980   0.7033749
levellead1                  -1.612  0.9958441 -1.6187272   0.4330324
levellead4                  -2.202  0.9838821 -2.2380730   0.4386752
levellead6                  -2.626  0.9316319 -2.8187098   0.5278091
groupTreatment               0.268  1.3251428  0.2022423   0.9944085
levellead1:groupTreatment  -11.406  1.4083363 -8.0989180   1.1086833
levellead4:groupTreatment   -8.824  1.3914194 -6.3417258   1.1408849
levellead6:groupTreatment   -3.152  1.3175265 -2.3923617   1.2439296
                             Robust z
(Intercept)                37.3513450
levellead1                 -3.7225849
levellead4                 -5.0196593
levellead6                 -4.9752836
groupTreatment              0.2695069
levellead1:groupTreatment -10.2878795
levellead4:groupTreatment  -7.7343471
levellead6:groupTreatment  -2.5339053

Estimated Scale Parameter:  43.90009
Number of Iterations:  1

Working Correlation
          [,1]      [,2]      [,3]      [,4]
[1,] 1.0000000 0.4352485 0.4487346 0.5057310
[2,] 0.4352485 1.0000000 0.8094551 0.6759677
[3,] 0.4487346 0.8094551 1.0000000 0.6975035
[4,] 0.5057310 0.6759677 0.6975035 1.0000000

2. Using the geepack package

The syntax for the function geeglm() in geepack package is

geeglm(formula, family = gaussian, data, id, zcor = NULL, constr, std.err = ‘san.se’)

  • Using the geepack package
    • formula: Symbolic description of the model to be fitted
    • family: Description of the error distribution and link function
    • data: Optional dataframe
    • id: Vector that identifies the clusters
    • contr: A character string specifying the correlation structure. The following are permitted: “independence”, “exchangeable”, “ar1”, “unstructured” and “userdefined”
    • zcor: Enter a user defined correlation structure

Consider the same model 1 defined earlier

# Model 1: Response Profiles (time T categorical)
mod_tlc_geep <- geeglm(measurements ~ level + group + level * group, id = id, data = long_TLC,
    family = gaussian(link = "identity"), corstr = "unstructured")

summary(mod_tlc_geep)

Call:
geeglm(formula = measurements ~ level + group + level * group, 
    family = gaussian(link = "identity"), data = long_TLC, id = id, 
    corstr = "unstructured")

 Coefficients:
                          Estimate  Std.err     Wald Pr(>|W|)    
(Intercept)                26.2720   0.7034 1395.123  < 2e-16 ***
levellead1                 -1.6120   0.4330   13.858 0.000197 ***
levellead4                 -2.2020   0.4387   25.197 5.18e-07 ***
levellead6                 -2.6260   0.5278   24.753 6.52e-07 ***
groupTreatment              0.2680   0.9944    0.073 0.787540    
levellead1:groupTreatment -11.4060   1.1087  105.840  < 2e-16 ***
levellead4:groupTreatment  -8.8240   1.1409   59.820 1.04e-14 ***
levellead6:groupTreatment  -3.1520   1.2439    6.421 0.011280 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = unstructured 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)    43.02   6.583
  Link = identity 

Estimated Correlation Parameters:
          Estimate Std.err
alpha.1:2   0.4352 0.07616
alpha.1:3   0.4487 0.08482
alpha.1:4   0.5057 0.05768
alpha.2:3   0.8095 0.11232
alpha.2:4   0.6760 0.10359
alpha.3:4   0.6975 0.13711
Number of clusters:   100  Maximum cluster size: 4 

We can use the QIC() function from the geepack package.

QIC(mod_tlc_geep)
      QIC      QICu Quasi Lik       CIC    params      QICC 
    17225     17225     -8604         8         8     17230 

3. Using the geeM package

The syntax for the function geem() in geeM package is

geem(formula, id, waves = NULL, data, family = gaussian, constr = “independence”, Mv = 1)

  • Using the geeM package
    • formula: Symbolic description of the model to be fitted
    • id: Vector that identifies the clusters
    • data: Optional dataframe
    • family: Description of the error distribution and link function
    • constr: A character string specifying the correlation structure. Allowed structures are: “independence”, “exchangeable” (equal correlation), “ar1” (exponential decay), “m-dependent” (m-diagonal), “unstructured”, “fixed”, and “userdefined”.
    • Mv: for “m-dependent”, the value for m.

Consider the same model 1 defined earlier

# Model 1: Response Profiles (time T categorical)
mod_tlc_geeM <- geem(measurements ~ level + group + level * group, id = id, data = long_TLC,
    family = gaussian(link = "identity"), corstr = "unstructured")

summary(mod_tlc_geeM)
                          Estimates Model SE Robust SE     wald         p
(Intercept)                  26.270    1.143    0.7034  37.3500 0.000e+00
levellead1                   -1.612    1.543    0.4330  -3.7230 1.972e-04
levellead4                   -2.202    1.487    0.4387  -5.0200 5.200e-07
levellead6                   -2.626    1.270    0.5278  -4.9750 6.500e-07
groupTreatment                0.268    1.617    0.9944   0.2695 7.875e-01
levellead1:groupTreatment   -11.410    2.182    1.1090 -10.2900 0.000e+00
levellead4:groupTreatment    -8.824    2.103    1.1410  -7.7340 0.000e+00
levellead6:groupTreatment    -3.152    1.795    1.2440  -2.5340 1.128e-02

 Working Correlation: 
       [,1]   [,2]   [,3]   [,4]
[1,] 1.0000 0.0893 0.1546 0.3835
[2,] 0.0893 1.0000 1.0716 0.6018
[3,] 0.1546 1.0716 1.0000 0.5901
[4,] 0.3835 0.6018 0.5901 1.0000
 Correlation Structure:  unstructured 
 Est. Scale Parameter:  65.36 

 Number of GEE iterations: 1 
 Number of Clusters:  100    Maximum Cluster Size:  4 
 Number of observations with nonzero weight:  400 

2.4 Response Profile Analysis

Model 1: \[E(y_{ij} \mid X_{ij}) = \beta_0 + \beta_{trt}trt + \beta_{T_1}T_1 + \beta_{T_4}T_4 + \beta_{T_6}T_6 + \] \[\beta_{trt:T_1}trt*T_1+ \beta_{trt:T_4}trt*T_4 + \beta_{trt:T_6}trt*T_6\]

This model includes time as a categorical covariate and also includes an interaction term to account and adjust for any possible discrepancies between the treatment and control.

Using the GLS approach, we can fit a model to include time, treatment, and their interaction as categorical variables (use dummy variables). What is your overall conclusion?

# Recall the example above from the nlme package

# Model 1: Response Profiles (time T categorical)
resprof_gls_mod_1 <- gls(measurements ~ factor(group) * factor(level), data = long_TLC,
    corr = corCompSymm(, form = ~factor(level) | id), weights = varIdent(form = ~1 |
        factor(level)))

# Print summary
summary(resprof_gls_mod_1)
Generalized least squares fit by REML
  Model: measurements ~ factor(group) * factor(level) 
  Data: long_TLC 
   AIC  BIC logLik
  2460 2512  -1217

Correlation Structure: Compound symmetry
 Formula: ~factor(level) | id 
 Parameter estimate(s):
   Rho 
0.6103 
Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | factor(level) 
 Parameter estimates:
lead0 lead1 lead4 lead6 
1.000 1.280 1.323 1.519 

Coefficients:
                                            Value Std.Error t-value p-value
(Intercept)                                26.272    0.7238   36.30  0.0000
factor(group)Treatment                      0.268    1.0236    0.26  0.7936
factor(level)lead1                         -1.612    0.7507   -2.15  0.0324
factor(level)lead4                         -2.202    0.7714   -2.85  0.0045
factor(level)lead6                         -2.626    0.8727   -3.01  0.0028
factor(group)Treatment:factor(level)lead1 -11.406    1.0616  -10.74  0.0000
factor(group)Treatment:factor(level)lead4  -8.824    1.0909   -8.09  0.0000
factor(group)Treatment:factor(level)lead6  -3.152    1.2342   -2.55  0.0110

 Correlation: 
                                          (Intr) fct()T fct()1 fct()4 fct()6
factor(group)Treatment                    -0.707                            
factor(level)lead1                        -0.211  0.149                     
factor(level)lead4                        -0.181  0.128  0.402              
factor(level)lead6                        -0.060  0.043  0.383  0.383       
factor(group)Treatment:factor(level)lead1  0.149 -0.211 -0.707 -0.285 -0.270
factor(group)Treatment:factor(level)lead4  0.128 -0.181 -0.285 -0.707 -0.271
factor(group)Treatment:factor(level)lead6  0.043 -0.060 -0.270 -0.271 -0.707
                                          f()T:()1 f()T:()4
factor(group)Treatment                                     
factor(level)lead1                                         
factor(level)lead4                                         
factor(level)lead6                                         
factor(group)Treatment:factor(level)lead1                  
factor(group)Treatment:factor(level)lead4  0.402           
factor(group)Treatment:factor(level)lead6  0.383    0.383  

Standardized residuals:
    Min      Q1     Med      Q3     Max 
-2.1429 -0.6928 -0.1529  0.5263  5.5480 

Residual standard error: 5.118 
Degrees of freedom: 400 total; 392 residual

Using the GEE approach, we can fit a model to include time, treatment, and their interaction as categorical variables (use dummy variables). Assume unstructured covariance model. Compare model estimates, standard errors, and covariance matrices with those from the methods you used in the previous question.

model_1 <- geeglm(measurements ~ group + as.factor(time) + as.factor(time) * group,
    id = id, data = long_TLC, family = gaussian(link = "identity"), corstr = "unstructured")
summary(model_1)

Call:
geeglm(formula = measurements ~ group + as.factor(time) + as.factor(time) * 
    group, family = gaussian(link = "identity"), data = long_TLC, 
    id = id, corstr = "unstructured")

 Coefficients:
                                Estimate Std.err    Wald Pr(>|W|)    
(Intercept)                       26.272   0.703 1395.12  < 2e-16 ***
groupTreatment                     0.268   0.994    0.07   0.7875    
as.factor(time)1                  -1.612   0.433   13.86   0.0002 ***
as.factor(time)4                  -2.202   0.439   25.20  5.2e-07 ***
as.factor(time)6                  -2.626   0.528   24.75  6.5e-07 ***
groupTreatment:as.factor(time)1  -11.406   1.109  105.84  < 2e-16 ***
groupTreatment:as.factor(time)4   -8.824   1.141   59.82  1.0e-14 ***
groupTreatment:as.factor(time)6   -3.152   1.244    6.42   0.0113 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = unstructured 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)       43    6.58
  Link = identity 

Estimated Correlation Parameters:
          Estimate Std.err
alpha.1:2    0.435  0.0762
alpha.1:3    0.449  0.0848
alpha.1:4    0.506  0.0577
alpha.2:3    0.809  0.1123
alpha.2:4    0.676  0.1036
alpha.3:4    0.698  0.1371
Number of clusters:   100  Maximum cluster size: 4 
# Recall the example above from the geepack package

# Model 1: Response Profiles (time T categorical)
resprof_gee_mod_1 <- geeglm(measurements ~ level + group + level * group, id = id,
    data = long_TLC, family = gaussian(link = "identity"), corstr = "unstructured")

# Print summary
summary(resprof_gee_mod_1)

Call:
geeglm(formula = measurements ~ level + group + level * group, 
    family = gaussian(link = "identity"), data = long_TLC, id = id, 
    corstr = "unstructured")

 Coefficients:
                          Estimate Std.err    Wald Pr(>|W|)    
(Intercept)                 26.272   0.703 1395.12  < 2e-16 ***
levellead1                  -1.612   0.433   13.86   0.0002 ***
levellead4                  -2.202   0.439   25.20  5.2e-07 ***
levellead6                  -2.626   0.528   24.75  6.5e-07 ***
groupTreatment               0.268   0.994    0.07   0.7875    
levellead1:groupTreatment  -11.406   1.109  105.84  < 2e-16 ***
levellead4:groupTreatment   -8.824   1.141   59.82  1.0e-14 ***
levellead6:groupTreatment   -3.152   1.244    6.42   0.0113 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = unstructured 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)       43    6.58
  Link = identity 

Estimated Correlation Parameters:
          Estimate Std.err
alpha.1:2    0.435  0.0762
alpha.1:3    0.449  0.0848
alpha.1:4    0.506  0.0577
alpha.2:3    0.809  0.1123
alpha.2:4    0.676  0.1036
alpha.3:4    0.698  0.1371
Number of clusters:   100  Maximum cluster size: 4 
# Print QIC
QIC(resprof_gee_mod_1)
      QIC      QICu Quasi Lik       CIC    params      QICC 
    17225     17225     -8604         8         8     17230 

2.5 Parametric Curves

# create new variable to make time continuous
long_TLC$week = 0  #baseline, week 0 
long_TLC$week[long_TLC$level == "lead1"] = 1  #week 1
long_TLC$week[long_TLC$level == "lead4"] = 4  #week 4
long_TLC$week[long_TLC$level == "lead6"] = 6  #week 6
parcurv_mod_2 <- gls(measurements ~ factor(group) + week, data = long_TLC, corr = corCompSymm(,
    form = ~week | id), weights = varIdent(form = ~1 | week), method = "REML")

# Print summary
summary(parcurv_mod_2)
Generalized least squares fit by REML
  Model: measurements ~ factor(group) + week 
  Data: long_TLC 
   AIC  BIC logLik
  2663 2695  -1323

Correlation Structure: Compound symmetry
 Formula: ~week | id 
 Parameter estimate(s):
  Rho 
0.659 
Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | week 
 Parameter estimates:
   0    1    4    6 
1.00 2.22 1.83 1.54 

Coefficients:
                       Value Std.Error t-value p-value
(Intercept)            26.53     0.742    35.8  0.0000
factor(group)Treatment  2.30     1.049     2.2  0.0291
week                   -0.58     0.098    -5.9  0.0000

 Correlation: 
                       (Intr) fct()T
factor(group)Treatment -0.707       
week                   -0.035  0.000

Standardized residuals:
    Min      Q1     Med      Q3     Max 
-2.5179 -1.0636 -0.3993  0.0715  4.5654 

Residual standard error: 5.48 
Degrees of freedom: 400 total; 397 residual

2.6 More Marginal Models

Model 1: time is categorical, main effects, interaction terms

Model 1: \[E(y_{ij} \mid X_{ij}) = \beta_0 + \beta_{trt}trt + \beta_{T_1}T_1 + \beta_{T_4}T_4 + \beta_{T_6}T_6 + \beta_{trt:T_1}trt*T_1+ \] \[\beta_{trt:T_4}trt*T_4 + \beta_{trt:T_6}trt*T_6\]

This model includes time as a categorical covariate and also includes an interaction term to account and adjust for any possible discrepancies between the treatment and control.

model_1 <- geeglm(measurements ~ group + as.factor(time) + as.factor(time) * group,
    id = id, data = long_TLC, family = gaussian(link = "identity"), corstr = "unstructured")

# Print summary
summary(model_1)

Call:
geeglm(formula = measurements ~ group + as.factor(time) + as.factor(time) * 
    group, family = gaussian(link = "identity"), data = long_TLC, 
    id = id, corstr = "unstructured")

 Coefficients:
                                Estimate Std.err    Wald Pr(>|W|)    
(Intercept)                       26.272   0.703 1395.12  < 2e-16 ***
groupTreatment                     0.268   0.994    0.07   0.7875    
as.factor(time)1                  -1.612   0.433   13.86   0.0002 ***
as.factor(time)4                  -2.202   0.439   25.20  5.2e-07 ***
as.factor(time)6                  -2.626   0.528   24.75  6.5e-07 ***
groupTreatment:as.factor(time)1  -11.406   1.109  105.84  < 2e-16 ***
groupTreatment:as.factor(time)4   -8.824   1.141   59.82  1.0e-14 ***
groupTreatment:as.factor(time)6   -3.152   1.244    6.42   0.0113 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = unstructured 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)       43    6.58
  Link = identity 

Estimated Correlation Parameters:
          Estimate Std.err
alpha.1:2    0.435  0.0762
alpha.1:3    0.449  0.0848
alpha.1:4    0.506  0.0577
alpha.2:3    0.809  0.1123
alpha.2:4    0.676  0.1036
alpha.3:4    0.698  0.1371
Number of clusters:   100  Maximum cluster size: 4 

Model 2: time is continuous, main effects

Model 2: \(E(y_{ij} \mid X_{ij}) = \beta_0 + \beta_{trt}trt + \beta_{T}T\)

This model includes time as a continuous covariate and doesn’t specify any interaction between treatment and control groups.

model_2 <- geeglm(measurements ~ group + time, id = id, data = long_TLC, family = gaussian(link = "identity"),
    corstr = "unstructured")

# Print summary
summary(model_2)

Call:
geeglm(formula = measurements ~ group + time, family = gaussian(link = "identity"), 
    data = long_TLC, id = id, corstr = "unstructured")

 Coefficients:
               Estimate Std.err    Wald Pr(>|W|)    
(Intercept)     25.5006  0.7040 1312.05  < 2e-16 ***
groupTreatment  -5.3442  1.0307   26.89  2.2e-07 ***
time            -0.1510  0.0913    2.74    0.098 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = unstructured 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)       56    7.57
  Link = identity 

Estimated Correlation Parameters:
          Estimate Std.err
alpha.1:2  -0.0399  0.1068
alpha.1:3   0.1088  0.1099
alpha.1:4   0.4679  0.0820
alpha.2:3   0.8603  0.0929
alpha.2:4   0.4375  0.1145
alpha.3:4   0.4881  0.1208
Number of clusters:   100  Maximum cluster size: 4 

Model 3: time is continuous, main effects, interaction term

Model 3: \(E(y_{ij} \mid X_{ij}) = \beta_0 + \beta_{trt}trt + \beta_{T}T + \beta_{trt:T}trt*T\)

This model includes time as a continuous covariate and also includes an interaction term to account and adjust for any possible discrepancies between the treatment and control.

# Define Model 3 and label it as: model_3
model_3 <- geeglm(measurements ~ group + time + time * group, id = id, data = long_TLC,
    family = gaussian(link = "identity"), corstr = "unstructured")

# Print summary
summary(model_3)

Call:
geeglm(formula = measurements ~ group + time + time * group, 
    family = gaussian(link = "identity"), data = long_TLC, id = id, 
    corstr = "unstructured")

 Coefficients:
                    Estimate Std.err    Wald Pr(>|W|)    
(Intercept)           25.589   0.707 1309.53  < 2e-16 ***
groupTreatment        -5.694   1.054   29.19  6.5e-08 ***
time                  -0.265   0.102    6.74   0.0094 ** 
groupTreatment:time    0.600   0.203    8.76   0.0031 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = unstructured 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)     58.6    7.61
  Link = identity 

Estimated Correlation Parameters:
          Estimate Std.err
alpha.1:2 -0.06469  0.1038
alpha.1:3  0.00237  0.1233
alpha.1:4  0.30449  0.1035
alpha.2:3  0.92472  0.0824
alpha.2:4  0.56396  0.1051
alpha.3:4  0.56953  0.1160
Number of clusters:   100  Maximum cluster size: 4 

Model 4: time is continuous, time^2, main effects

Model 4: \(E(y_{ij} \mid X_{ij}) = \beta_0 + \beta_{trt}trt + \beta_{T}T + \beta_{T^2}T^2\)

This model treats time as a continuous covariate and also introduces a quadratic term. As a general note when adding quadratic, or higher power terms, into your regression you can either create a new covariate in your dataframe and put that into your formula, or you can take the existing one and specify the term as “I(x^2)”.

# Define Model 4 and label it as: model_4
model_4 <- geeglm(measurements ~ group + time + I(time^2), id = id, data = long_TLC,
    family = gaussian(link = "identity"), corstr = "unstructured")

# Print summary
summary(model_4)

Call:
geeglm(formula = measurements ~ group + time + I(time^2), family = gaussian(link = "identity"), 
    data = long_TLC, id = id, corstr = "unstructured")

 Coefficients:
               Estimate Std.err   Wald Pr(>|W|)    
(Intercept)     26.1837  0.7144 1343.5  < 2e-16 ***
groupTreatment  -5.1837  1.0267   25.5  4.4e-07 ***
time            -2.0829  0.4685   19.8  8.7e-06 ***
I(time^2)        0.3299  0.0809   16.6  4.5e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = unstructured 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)     52.3    6.93
  Link = identity 

Estimated Correlation Parameters:
          Estimate Std.err
alpha.1:2    0.055  0.1133
alpha.1:3    0.241  0.1109
alpha.1:4    0.435  0.0903
alpha.2:3    0.801  0.1005
alpha.2:4    0.531  0.1120
alpha.3:4    0.548  0.1177
Number of clusters:   100  Maximum cluster size: 4 

Model 5: time is continuous, time^2, main effects, interaction

Model 5: \(E(y_{ij} \mid X_{ij}) = \beta_0 + \beta_{trt}trt + \beta_{T}T + \beta_{T^2}T^2 + \beta_{trt:T}trt*T+\beta_{trt:T^2}trt*T^2\)

This model includes time as a continuous covariate and also includes a quadratic term to capture any possible non-linearity in time and an interaction term to account and adjust for any possible discrepancies between the treatment and control.

# Define Model 5 and label it as: model_5

model_5 <- geeglm(measurements ~ group + time + I(time^2) + group:(time + I(time^2)),
    id = id, data = long_TLC, family = gaussian(link = "identity"), corstr = "unstructured")
# Print summary
summary(model_5)

Call:
geeglm(formula = measurements ~ group + time + I(time^2) + group:(time + 
    I(time^2)), family = gaussian(link = "identity"), data = long_TLC, 
    id = id, corstr = "unstructured")

 Coefficients:
                         Estimate Std.err    Wald Pr(>|W|)    
(Intercept)               25.7785  0.6915 1389.59  < 2e-16 ***
groupTreatment            -3.4262  1.0497   10.65   0.0011 ** 
time                      -0.6750  0.2567    6.91   0.0085 ** 
I(time^2)                  0.0619  0.0417    2.20   0.1377    
groupTreatment:time       -4.8098  0.8088   35.36  2.7e-09 ***
groupTreatment:I(time^2)   0.8814  0.1393   40.00  2.5e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation structure = unstructured 
Estimated Scale Parameters:

            Estimate Std.err
(Intercept)     48.5     6.6
  Link = identity 

Estimated Correlation Parameters:
          Estimate Std.err
alpha.1:2    0.199  0.1099
alpha.1:3    0.399  0.0988
alpha.1:4    0.333  0.1149
alpha.2:3    0.718  0.1067
alpha.2:4    0.719  0.1011
alpha.3:4    0.619  0.1175
Number of clusters:   100  Maximum cluster size: 4 

Now that we have our five models how exactly can we go about choosing which one is the best?

A common method of model selection is using the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC). For reference \(AIC = 2p - 2log(L)\) where \(L\) is the likelihood and p is the number of covariates (excluding the intercept). Similarly $BIC = plog(n) - 2log(L) $ where n is the sample size of the data. As an aside when we say log() we are strictly talking about the natural logarithm.

Given a collection of models \(\{ \mathcal{M}_m \}^M_{m=1}\) we would like to know which one to choose. AIC and BIC both reward goodness of fit (as assessed by the likelihood function), but this can lead to problems in overfitting (i.e. using too much information that makes a model too specific). Overfitting could arise because increasing the number of covariates could cause the likelihood \(\hat{L}\) to increase. To counteract this problem both the AIC and BIC introduce a penalty term that is meant to punish models that are built with a high number of predictors (predictors is synonymous with covariates). AIC uses the \(2p\) penalty and BIC uses the \(plog(n)\) penalty. Unfortunately for us using the AIC and BIC requires for their to be an actual likelihood. GEE’s don’t have likelihood’s attached to them so it would be non-sensical for us to use either AIC or BIC to select a model.

One way to work around this issue is through the introduction of the Quasi-likelihood Information Criterion (QIC). Developed by Pan (2001), the QIC is a modification of the AIC for models fitted by GEE. Using the QIC() function from the geepack package we can calculate the QIC’s of each model and see which of the five has the minimum QIC.

QIC_comparisons <- data.frame(`Model Name` = c("Model 1", "Model 2", "Model 3", "Model 4",
    "Model 5"), QIC = c(QIC(model_1)[1], QIC(model_2)[1], QIC(model_3)[1], QIC(model_4)[1],
    QIC(model_5)[1]))
QIC_comparisons
  Model.Name   QIC
1    Model 1 17225
2    Model 2 22419
3    Model 3 23441
4    Model 4 20927
5    Model 5 19397

From the above we can see that Model 1 has the smallest QIC out of the five models.