library(tidyr) #Allows for us to manipulate the data structure
library(data.table) #Allows for us to manipulate the data structure
library(ggplot2) #this makes better looking plots in R
# new packages for models
library(nlme) #used for the gls() function
library(gee) # install.packages('gee'), used for the gee() function
library(geepack) # install.packages('geepack'), used for the geeglm() function
library(geeM) # install.packages('geeM'), used for the geeglm() function
theme_set(theme_minimal())
2 Marginal Model (GEE)
2.1 Introduction
For this section we will return to the TLC dataset with N = 100 participants in order to better understand marginal models for longitudinal data when the outcome is continuous.
2.2 Required Packages
We will use the following packages:
nlme
: This package is for fitting and comparing Gaussian linear and nonlinear mixed-effects models. The variance-covariance structures for the residuals can be specified, and it is useful for data with repeated measures or longitudinal designs. We will use thegls()
function to fit a linear model using generalized least squares.lme4
: This package is for fitting mixed-effects models. We will use thelmer()
function to fit a linear model.
These packages are for applying the generalized estimating equations (GEE) approach for fitting marginal generalized linear models to data with repeated measures or longitudinal designs:
gee
: This is the “Generalized Estimation Equation Solver” package.gpack
: This is the “Generalized Estimating Equation Package” package.geeM
: This is the “Solve Generalized Estimating Equations” package.
2.3 Constructing Marginal Models
Consider a continuous outcome \(Y\) which indicates a patient’s blood lead level (in micrograms per deciLiter \(\mu\)g/dL). The correct way to write Y when doing a model like this would be as \(Y_{ij}\). \(Y_{ij}\) is the lead blood level of patient i at time point j. Likewise the covariates \(X = (1,X^1, \dots,X^p)\) would be as \(X_{ij} = (1,X^1_{ij},\dots X^p_{ij})\), the p covariates for individual i measured at time point j. 1 is simply the intercept. The superscript here just represents the \(k-th\) covariate in the analysis (i.e. in other words we have p covariates).
In these different modelling strategies we always specify a link between the mean of our distribution and the data, \(E(y_{ij} \mid X_{ij}) = \mu_{ij} = g^{-1}(X^T_{ij}\beta)\). Here \(g^{-1}\) is what is commonly referred to as a link function. It is exactly what it sounds like. A function which links the mean of your distribution to your data. There are many common link functions that you’ve probably already dealt with in other regression courses. For example there is the logit link function \(g(\mu) =log(\frac{\mu}{1-\mu}) = X^T\beta\) that one uses when doing logistic function.
When one is doing linear regression a common link function is just the identity link \(g(\mu) =\mu = X^T\beta\).
Here we introduce a few functions and corresponding packages to fit these models:
1. Using the gls() function to fit a linear model using generalized least squares
The syntax for the function gls()
in nlme
package is
gls(model, data, correlation, weights, subset, method, na.action, control, verbose)
- Description
model
: A two-sided linear formula object describing the modeldata
: Optional dataframecorrelation
: See description for details.- The corStruct object describes the within-group correlation structure.
- Note we use
corCompSymm
which has the following syntax: corCompSymm(value, form, fixed) compound symmetry structure corresponding to a constant correlation.
weights
: See description for details.- Describes the variance function.
- Note we use
varClasses
which defines standard classes of variance function structures and usevarIdent
which describes constant variance(s), that is generally used to allow different variances.
We consider the first model discussed in lecture. This model includes time as a categorical covariate and also includes an interaction term to account and adjust for any possible discrepancies between the treatment and control.
Model 1: \[E(y_{ij} \mid X_{ij}) = \beta_0 + \beta_{trt}trt + \beta_{T_1}T_1 + \beta_{T_4}T_4 + \beta_{T_6}T_6 + \] \[\beta_{trt:T_1}trt*T_1+ \beta_{trt:T_4}trt*T_4 + \beta_{trt:T_6}trt*T_6\]
# Model fit with compound symmetry heterogeneous variances
# Model 1: Response Profiles (time T categorical)
<- gls(measurements ~ factor(group) * factor(level), data = long_TLC,
fit.compsym corr = corCompSymm(, form = ~factor(level) | id), weights = varIdent(form = ~1 |
factor(level)))
summary(fit.compsym)
Generalized least squares fit by REML
Model: measurements ~ factor(group) * factor(level)
Data: long_TLC
AIC BIC logLik
2459.96 2511.587 -1216.98
Correlation Structure: Compound symmetry
Formula: ~factor(level) | id
Parameter estimate(s):
Rho
0.6102708
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | factor(level)
Parameter estimates:
lead0 lead1 lead4 lead6
1.000000 1.279672 1.323197 1.519220
Coefficients:
Value Std.Error t-value p-value
(Intercept) 26.272 0.7237951 36.29757 0.0000
factor(group)Treatment 0.268 1.0236008 0.26182 0.7936
factor(level)lead1 -1.612 0.7506796 -2.14739 0.0324
factor(level)lead4 -2.202 0.7713883 -2.85459 0.0045
factor(level)lead6 -2.626 0.8726932 -3.00908 0.0028
factor(group)Treatment:factor(level)lead1 -11.406 1.0616213 -10.74394 0.0000
factor(group)Treatment:factor(level)lead4 -8.824 1.0909078 -8.08868 0.0000
factor(group)Treatment:factor(level)lead6 -3.152 1.2341746 -2.55393 0.0110
Correlation:
(Intr) fct()T fct()1 fct()4 fct()6
factor(group)Treatment -0.707
factor(level)lead1 -0.211 0.149
factor(level)lead4 -0.181 0.128 0.402
factor(level)lead6 -0.060 0.043 0.383 0.383
factor(group)Treatment:factor(level)lead1 0.149 -0.211 -0.707 -0.285 -0.270
factor(group)Treatment:factor(level)lead4 0.128 -0.181 -0.285 -0.707 -0.271
factor(group)Treatment:factor(level)lead6 0.043 -0.060 -0.270 -0.271 -0.707
f()T:()1 f()T:()4
factor(group)Treatment
factor(level)lead1
factor(level)lead4
factor(level)lead6
factor(group)Treatment:factor(level)lead1
factor(group)Treatment:factor(level)lead4 0.402
factor(group)Treatment:factor(level)lead6 0.383 0.383
Standardized residuals:
Min Q1 Med Q3 Max
-2.1429199 -0.6927789 -0.1528885 0.5263121 5.5480303
Residual standard error: 5.118004
Degrees of freedom: 400 total; 392 residual
anova(fit.compsym)
Denom. DF: 392
numDF F-value p-value
(Intercept) 1 2438.1366 <.0001
factor(group) 1 7.1161 0.008
factor(level) 3 80.5490 <.0001
factor(group):factor(level) 3 46.9288 <.0001
2. Next, we discuss Marginal Models (GEE) and three R Packages
1. Using the gee
package
The syntax for the function gee()
in gee
package is
gee(formula, id, data, family = gaussian, constr = ‘independence’,Mv)
- Using the
GEE
Packageformula
: Symbolic description of the model to be fittedfamily
: Description of the error distribution and link functiondata
: Optional dataframeid
: Vector that identifies the clustersconstr
: Working correlation structure: “independence”, “exchangeable”, “AR-M”, “unstructured”Mv
: order of AR correlation (AR1: Mv = 1)
Consider the same model 1 defined earlier:
# Model 1: Response Profiles (time T categorical)
<- gee(measurements ~ level + group + level * group, id = id, data = long_TLC,
mod_tlc_gee family = gaussian(link = "identity"), corstr = "unstructured")
(Intercept) levellead1 levellead4
26.272 -1.612 -2.202
levellead6 groupTreatment levellead1:groupTreatment
-2.626 0.268 -11.406
levellead4:groupTreatment levellead6:groupTreatment
-8.824 -3.152
summary(mod_tlc_gee)
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)
Model:
Link: Identity
Variance to Mean Relation: Gaussian
Correlation Structure: Unstructured
Call:
gee(formula = measurements ~ level + group + level * group, id = id,
data = long_TLC, family = gaussian(link = "identity"), corstr = "unstructured")
Summary of Residuals:
Min 1Q Median 3Q Max
-16.6620002 -4.6205000 -0.9930005 3.6724993 43.1380019
Coefficients:
Estimate Naive S.E. Naive z Robust S.E.
(Intercept) 26.272 0.9370175 28.0378980 0.7033749
levellead1 -1.612 0.9958441 -1.6187272 0.4330324
levellead4 -2.202 0.9838821 -2.2380730 0.4386752
levellead6 -2.626 0.9316319 -2.8187098 0.5278091
groupTreatment 0.268 1.3251428 0.2022423 0.9944085
levellead1:groupTreatment -11.406 1.4083363 -8.0989180 1.1086833
levellead4:groupTreatment -8.824 1.3914194 -6.3417258 1.1408849
levellead6:groupTreatment -3.152 1.3175265 -2.3923617 1.2439296
Robust z
(Intercept) 37.3513450
levellead1 -3.7225849
levellead4 -5.0196593
levellead6 -4.9752836
groupTreatment 0.2695069
levellead1:groupTreatment -10.2878795
levellead4:groupTreatment -7.7343471
levellead6:groupTreatment -2.5339053
Estimated Scale Parameter: 43.90009
Number of Iterations: 1
Working Correlation
[,1] [,2] [,3] [,4]
[1,] 1.0000000 0.4352485 0.4487346 0.5057310
[2,] 0.4352485 1.0000000 0.8094551 0.6759677
[3,] 0.4487346 0.8094551 1.0000000 0.6975035
[4,] 0.5057310 0.6759677 0.6975035 1.0000000
2. Using the geepack
package
The syntax for the function geeglm()
in geepack
package is
geeglm(formula, family = gaussian, data, id, zcor = NULL, constr, std.err = ‘san.se’)
- Using the
geepack
packageformula
: Symbolic description of the model to be fittedfamily
: Description of the error distribution and link functiondata
: Optional dataframeid
: Vector that identifies the clusterscontr
: A character string specifying the correlation structure. The following are permitted: “independence”, “exchangeable”, “ar1”, “unstructured” and “userdefined”zcor
: Enter a user defined correlation structure
Consider the same model 1 defined earlier
# Model 1: Response Profiles (time T categorical)
<- geeglm(measurements ~ level + group + level * group, id = id, data = long_TLC,
mod_tlc_geep family = gaussian(link = "identity"), corstr = "unstructured")
summary(mod_tlc_geep)
Call:
geeglm(formula = measurements ~ level + group + level * group,
family = gaussian(link = "identity"), data = long_TLC, id = id,
corstr = "unstructured")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 26.2720 0.7034 1395.123 < 2e-16 ***
levellead1 -1.6120 0.4330 13.858 0.000197 ***
levellead4 -2.2020 0.4387 25.197 5.18e-07 ***
levellead6 -2.6260 0.5278 24.753 6.52e-07 ***
groupTreatment 0.2680 0.9944 0.073 0.787540
levellead1:groupTreatment -11.4060 1.1087 105.840 < 2e-16 ***
levellead4:groupTreatment -8.8240 1.1409 59.820 1.04e-14 ***
levellead6:groupTreatment -3.1520 1.2439 6.421 0.011280 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation structure = unstructured
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 43.02 6.583
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha.1:2 0.4352 0.07616
alpha.1:3 0.4487 0.08482
alpha.1:4 0.5057 0.05768
alpha.2:3 0.8095 0.11232
alpha.2:4 0.6760 0.10359
alpha.3:4 0.6975 0.13711
Number of clusters: 100 Maximum cluster size: 4
We can use the QIC() function from the geepack package.
QIC(mod_tlc_geep)
QIC QICu Quasi Lik CIC params QICC
17225 17225 -8604 8 8 17230
3. Using the geeM package
The syntax for the function geem()
in geeM
package is
geem(formula, id, waves = NULL, data, family = gaussian, constr = “independence”, Mv = 1)
- Using the
geeM
packageformula
: Symbolic description of the model to be fittedid
: Vector that identifies the clustersdata
: Optional dataframefamily
: Description of the error distribution and link functionconstr
: A character string specifying the correlation structure. Allowed structures are: “independence”, “exchangeable” (equal correlation), “ar1” (exponential decay), “m-dependent” (m-diagonal), “unstructured”, “fixed”, and “userdefined”.Mv
: for “m-dependent”, the value for m.
Consider the same model 1 defined earlier
# Model 1: Response Profiles (time T categorical)
<- geem(measurements ~ level + group + level * group, id = id, data = long_TLC,
mod_tlc_geeM family = gaussian(link = "identity"), corstr = "unstructured")
summary(mod_tlc_geeM)
Estimates Model SE Robust SE wald p
(Intercept) 26.270 1.143 0.7034 37.3500 0.000e+00
levellead1 -1.612 1.543 0.4330 -3.7230 1.972e-04
levellead4 -2.202 1.487 0.4387 -5.0200 5.200e-07
levellead6 -2.626 1.270 0.5278 -4.9750 6.500e-07
groupTreatment 0.268 1.617 0.9944 0.2695 7.875e-01
levellead1:groupTreatment -11.410 2.182 1.1090 -10.2900 0.000e+00
levellead4:groupTreatment -8.824 2.103 1.1410 -7.7340 0.000e+00
levellead6:groupTreatment -3.152 1.795 1.2440 -2.5340 1.128e-02
Working Correlation:
[,1] [,2] [,3] [,4]
[1,] 1.0000 0.0893 0.1546 0.3835
[2,] 0.0893 1.0000 1.0716 0.6018
[3,] 0.1546 1.0716 1.0000 0.5901
[4,] 0.3835 0.6018 0.5901 1.0000
Correlation Structure: unstructured
Est. Scale Parameter: 65.36
Number of GEE iterations: 1
Number of Clusters: 100 Maximum Cluster Size: 4
Number of observations with nonzero weight: 400
2.4 Response Profile Analysis
Model 1: \[E(y_{ij} \mid X_{ij}) = \beta_0 + \beta_{trt}trt + \beta_{T_1}T_1 + \beta_{T_4}T_4 + \beta_{T_6}T_6 + \] \[\beta_{trt:T_1}trt*T_1+ \beta_{trt:T_4}trt*T_4 + \beta_{trt:T_6}trt*T_6\]
This model includes time as a categorical covariate and also includes an interaction term to account and adjust for any possible discrepancies between the treatment and control.
Using the GLS approach, we can fit a model to include time, treatment, and their interaction as categorical variables (use dummy variables). What is your overall conclusion?
# Recall the example above from the nlme package
# Model 1: Response Profiles (time T categorical)
<- gls(measurements ~ factor(group) * factor(level), data = long_TLC,
resprof_gls_mod_1 corr = corCompSymm(, form = ~factor(level) | id), weights = varIdent(form = ~1 |
factor(level)))
# Print summary
summary(resprof_gls_mod_1)
Generalized least squares fit by REML
Model: measurements ~ factor(group) * factor(level)
Data: long_TLC
AIC BIC logLik
2460 2512 -1217
Correlation Structure: Compound symmetry
Formula: ~factor(level) | id
Parameter estimate(s):
Rho
0.6103
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | factor(level)
Parameter estimates:
lead0 lead1 lead4 lead6
1.000 1.280 1.323 1.519
Coefficients:
Value Std.Error t-value p-value
(Intercept) 26.272 0.7238 36.30 0.0000
factor(group)Treatment 0.268 1.0236 0.26 0.7936
factor(level)lead1 -1.612 0.7507 -2.15 0.0324
factor(level)lead4 -2.202 0.7714 -2.85 0.0045
factor(level)lead6 -2.626 0.8727 -3.01 0.0028
factor(group)Treatment:factor(level)lead1 -11.406 1.0616 -10.74 0.0000
factor(group)Treatment:factor(level)lead4 -8.824 1.0909 -8.09 0.0000
factor(group)Treatment:factor(level)lead6 -3.152 1.2342 -2.55 0.0110
Correlation:
(Intr) fct()T fct()1 fct()4 fct()6
factor(group)Treatment -0.707
factor(level)lead1 -0.211 0.149
factor(level)lead4 -0.181 0.128 0.402
factor(level)lead6 -0.060 0.043 0.383 0.383
factor(group)Treatment:factor(level)lead1 0.149 -0.211 -0.707 -0.285 -0.270
factor(group)Treatment:factor(level)lead4 0.128 -0.181 -0.285 -0.707 -0.271
factor(group)Treatment:factor(level)lead6 0.043 -0.060 -0.270 -0.271 -0.707
f()T:()1 f()T:()4
factor(group)Treatment
factor(level)lead1
factor(level)lead4
factor(level)lead6
factor(group)Treatment:factor(level)lead1
factor(group)Treatment:factor(level)lead4 0.402
factor(group)Treatment:factor(level)lead6 0.383 0.383
Standardized residuals:
Min Q1 Med Q3 Max
-2.1429 -0.6928 -0.1529 0.5263 5.5480
Residual standard error: 5.118
Degrees of freedom: 400 total; 392 residual
Using the GEE approach, we can fit a model to include time, treatment, and their interaction as categorical variables (use dummy variables). Assume unstructured covariance model. Compare model estimates, standard errors, and covariance matrices with those from the methods you used in the previous question.
<- geeglm(measurements ~ group + as.factor(time) + as.factor(time) * group,
model_1 id = id, data = long_TLC, family = gaussian(link = "identity"), corstr = "unstructured")
summary(model_1)
Call:
geeglm(formula = measurements ~ group + as.factor(time) + as.factor(time) *
group, family = gaussian(link = "identity"), data = long_TLC,
id = id, corstr = "unstructured")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 26.272 0.703 1395.12 < 2e-16 ***
groupTreatment 0.268 0.994 0.07 0.7875
as.factor(time)1 -1.612 0.433 13.86 0.0002 ***
as.factor(time)4 -2.202 0.439 25.20 5.2e-07 ***
as.factor(time)6 -2.626 0.528 24.75 6.5e-07 ***
groupTreatment:as.factor(time)1 -11.406 1.109 105.84 < 2e-16 ***
groupTreatment:as.factor(time)4 -8.824 1.141 59.82 1.0e-14 ***
groupTreatment:as.factor(time)6 -3.152 1.244 6.42 0.0113 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation structure = unstructured
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 43 6.58
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha.1:2 0.435 0.0762
alpha.1:3 0.449 0.0848
alpha.1:4 0.506 0.0577
alpha.2:3 0.809 0.1123
alpha.2:4 0.676 0.1036
alpha.3:4 0.698 0.1371
Number of clusters: 100 Maximum cluster size: 4
# Recall the example above from the geepack package
# Model 1: Response Profiles (time T categorical)
<- geeglm(measurements ~ level + group + level * group, id = id,
resprof_gee_mod_1 data = long_TLC, family = gaussian(link = "identity"), corstr = "unstructured")
# Print summary
summary(resprof_gee_mod_1)
Call:
geeglm(formula = measurements ~ level + group + level * group,
family = gaussian(link = "identity"), data = long_TLC, id = id,
corstr = "unstructured")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 26.272 0.703 1395.12 < 2e-16 ***
levellead1 -1.612 0.433 13.86 0.0002 ***
levellead4 -2.202 0.439 25.20 5.2e-07 ***
levellead6 -2.626 0.528 24.75 6.5e-07 ***
groupTreatment 0.268 0.994 0.07 0.7875
levellead1:groupTreatment -11.406 1.109 105.84 < 2e-16 ***
levellead4:groupTreatment -8.824 1.141 59.82 1.0e-14 ***
levellead6:groupTreatment -3.152 1.244 6.42 0.0113 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation structure = unstructured
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 43 6.58
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha.1:2 0.435 0.0762
alpha.1:3 0.449 0.0848
alpha.1:4 0.506 0.0577
alpha.2:3 0.809 0.1123
alpha.2:4 0.676 0.1036
alpha.3:4 0.698 0.1371
Number of clusters: 100 Maximum cluster size: 4
# Print QIC
QIC(resprof_gee_mod_1)
QIC QICu Quasi Lik CIC params QICC
17225 17225 -8604 8 8 17230
2.5 Parametric Curves
# create new variable to make time continuous
$week = 0 #baseline, week 0
long_TLC$week[long_TLC$level == "lead1"] = 1 #week 1
long_TLC$week[long_TLC$level == "lead4"] = 4 #week 4
long_TLC$week[long_TLC$level == "lead6"] = 6 #week 6 long_TLC
<- gls(measurements ~ factor(group) + week, data = long_TLC, corr = corCompSymm(,
parcurv_mod_2 form = ~week | id), weights = varIdent(form = ~1 | week), method = "REML")
# Print summary
summary(parcurv_mod_2)
Generalized least squares fit by REML
Model: measurements ~ factor(group) + week
Data: long_TLC
AIC BIC logLik
2663 2695 -1323
Correlation Structure: Compound symmetry
Formula: ~week | id
Parameter estimate(s):
Rho
0.659
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | week
Parameter estimates:
0 1 4 6
1.00 2.22 1.83 1.54
Coefficients:
Value Std.Error t-value p-value
(Intercept) 26.53 0.742 35.8 0.0000
factor(group)Treatment 2.30 1.049 2.2 0.0291
week -0.58 0.098 -5.9 0.0000
Correlation:
(Intr) fct()T
factor(group)Treatment -0.707
week -0.035 0.000
Standardized residuals:
Min Q1 Med Q3 Max
-2.5179 -1.0636 -0.3993 0.0715 4.5654
Residual standard error: 5.48
Degrees of freedom: 400 total; 397 residual
2.6 More Marginal Models
Model 1: time is categorical, main effects, interaction terms
Model 1: \[E(y_{ij} \mid X_{ij}) = \beta_0 + \beta_{trt}trt + \beta_{T_1}T_1 + \beta_{T_4}T_4 + \beta_{T_6}T_6 + \beta_{trt:T_1}trt*T_1+ \] \[\beta_{trt:T_4}trt*T_4 + \beta_{trt:T_6}trt*T_6\]
This model includes time as a categorical covariate and also includes an interaction term to account and adjust for any possible discrepancies between the treatment and control.
<- geeglm(measurements ~ group + as.factor(time) + as.factor(time) * group,
model_1 id = id, data = long_TLC, family = gaussian(link = "identity"), corstr = "unstructured")
# Print summary
summary(model_1)
Call:
geeglm(formula = measurements ~ group + as.factor(time) + as.factor(time) *
group, family = gaussian(link = "identity"), data = long_TLC,
id = id, corstr = "unstructured")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 26.272 0.703 1395.12 < 2e-16 ***
groupTreatment 0.268 0.994 0.07 0.7875
as.factor(time)1 -1.612 0.433 13.86 0.0002 ***
as.factor(time)4 -2.202 0.439 25.20 5.2e-07 ***
as.factor(time)6 -2.626 0.528 24.75 6.5e-07 ***
groupTreatment:as.factor(time)1 -11.406 1.109 105.84 < 2e-16 ***
groupTreatment:as.factor(time)4 -8.824 1.141 59.82 1.0e-14 ***
groupTreatment:as.factor(time)6 -3.152 1.244 6.42 0.0113 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation structure = unstructured
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 43 6.58
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha.1:2 0.435 0.0762
alpha.1:3 0.449 0.0848
alpha.1:4 0.506 0.0577
alpha.2:3 0.809 0.1123
alpha.2:4 0.676 0.1036
alpha.3:4 0.698 0.1371
Number of clusters: 100 Maximum cluster size: 4
Model 2: time is continuous, main effects
Model 2: \(E(y_{ij} \mid X_{ij}) = \beta_0 + \beta_{trt}trt + \beta_{T}T\)
This model includes time as a continuous covariate and doesn’t specify any interaction between treatment and control groups.
<- geeglm(measurements ~ group + time, id = id, data = long_TLC, family = gaussian(link = "identity"),
model_2 corstr = "unstructured")
# Print summary
summary(model_2)
Call:
geeglm(formula = measurements ~ group + time, family = gaussian(link = "identity"),
data = long_TLC, id = id, corstr = "unstructured")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 25.5006 0.7040 1312.05 < 2e-16 ***
groupTreatment -5.3442 1.0307 26.89 2.2e-07 ***
time -0.1510 0.0913 2.74 0.098 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation structure = unstructured
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 56 7.57
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha.1:2 -0.0399 0.1068
alpha.1:3 0.1088 0.1099
alpha.1:4 0.4679 0.0820
alpha.2:3 0.8603 0.0929
alpha.2:4 0.4375 0.1145
alpha.3:4 0.4881 0.1208
Number of clusters: 100 Maximum cluster size: 4
Model 3: time is continuous, main effects, interaction term
Model 3: \(E(y_{ij} \mid X_{ij}) = \beta_0 + \beta_{trt}trt + \beta_{T}T + \beta_{trt:T}trt*T\)
This model includes time as a continuous covariate and also includes an interaction term to account and adjust for any possible discrepancies between the treatment and control.
# Define Model 3 and label it as: model_3
<- geeglm(measurements ~ group + time + time * group, id = id, data = long_TLC,
model_3 family = gaussian(link = "identity"), corstr = "unstructured")
# Print summary
summary(model_3)
Call:
geeglm(formula = measurements ~ group + time + time * group,
family = gaussian(link = "identity"), data = long_TLC, id = id,
corstr = "unstructured")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 25.589 0.707 1309.53 < 2e-16 ***
groupTreatment -5.694 1.054 29.19 6.5e-08 ***
time -0.265 0.102 6.74 0.0094 **
groupTreatment:time 0.600 0.203 8.76 0.0031 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation structure = unstructured
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 58.6 7.61
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha.1:2 -0.06469 0.1038
alpha.1:3 0.00237 0.1233
alpha.1:4 0.30449 0.1035
alpha.2:3 0.92472 0.0824
alpha.2:4 0.56396 0.1051
alpha.3:4 0.56953 0.1160
Number of clusters: 100 Maximum cluster size: 4
Model 4: time is continuous, time^2, main effects
Model 4: \(E(y_{ij} \mid X_{ij}) = \beta_0 + \beta_{trt}trt + \beta_{T}T + \beta_{T^2}T^2\)
This model treats time as a continuous covariate and also introduces a quadratic term. As a general note when adding quadratic, or higher power terms, into your regression you can either create a new covariate in your dataframe and put that into your formula, or you can take the existing one and specify the term as “I(x^2)”.
# Define Model 4 and label it as: model_4
<- geeglm(measurements ~ group + time + I(time^2), id = id, data = long_TLC,
model_4 family = gaussian(link = "identity"), corstr = "unstructured")
# Print summary
summary(model_4)
Call:
geeglm(formula = measurements ~ group + time + I(time^2), family = gaussian(link = "identity"),
data = long_TLC, id = id, corstr = "unstructured")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 26.1837 0.7144 1343.5 < 2e-16 ***
groupTreatment -5.1837 1.0267 25.5 4.4e-07 ***
time -2.0829 0.4685 19.8 8.7e-06 ***
I(time^2) 0.3299 0.0809 16.6 4.5e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation structure = unstructured
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 52.3 6.93
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha.1:2 0.055 0.1133
alpha.1:3 0.241 0.1109
alpha.1:4 0.435 0.0903
alpha.2:3 0.801 0.1005
alpha.2:4 0.531 0.1120
alpha.3:4 0.548 0.1177
Number of clusters: 100 Maximum cluster size: 4
Model 5: time is continuous, time^2, main effects, interaction
Model 5: \(E(y_{ij} \mid X_{ij}) = \beta_0 + \beta_{trt}trt + \beta_{T}T + \beta_{T^2}T^2 + \beta_{trt:T}trt*T+\beta_{trt:T^2}trt*T^2\)
This model includes time as a continuous covariate and also includes a quadratic term to capture any possible non-linearity in time and an interaction term to account and adjust for any possible discrepancies between the treatment and control.
# Define Model 5 and label it as: model_5
<- geeglm(measurements ~ group + time + I(time^2) + group:(time + I(time^2)),
model_5 id = id, data = long_TLC, family = gaussian(link = "identity"), corstr = "unstructured")
# Print summary
summary(model_5)
Call:
geeglm(formula = measurements ~ group + time + I(time^2) + group:(time +
I(time^2)), family = gaussian(link = "identity"), data = long_TLC,
id = id, corstr = "unstructured")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) 25.7785 0.6915 1389.59 < 2e-16 ***
groupTreatment -3.4262 1.0497 10.65 0.0011 **
time -0.6750 0.2567 6.91 0.0085 **
I(time^2) 0.0619 0.0417 2.20 0.1377
groupTreatment:time -4.8098 0.8088 35.36 2.7e-09 ***
groupTreatment:I(time^2) 0.8814 0.1393 40.00 2.5e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation structure = unstructured
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 48.5 6.6
Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha.1:2 0.199 0.1099
alpha.1:3 0.399 0.0988
alpha.1:4 0.333 0.1149
alpha.2:3 0.718 0.1067
alpha.2:4 0.719 0.1011
alpha.3:4 0.619 0.1175
Number of clusters: 100 Maximum cluster size: 4
Now that we have our five models how exactly can we go about choosing which one is the best?
A common method of model selection is using the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC). For reference \(AIC = 2p - 2log(L)\) where \(L\) is the likelihood and p is the number of covariates (excluding the intercept). Similarly $BIC = plog(n) - 2log(L) $ where n is the sample size of the data. As an aside when we say log() we are strictly talking about the natural logarithm.
Given a collection of models \(\{ \mathcal{M}_m \}^M_{m=1}\) we would like to know which one to choose. AIC and BIC both reward goodness of fit (as assessed by the likelihood function), but this can lead to problems in overfitting (i.e. using too much information that makes a model too specific). Overfitting could arise because increasing the number of covariates could cause the likelihood \(\hat{L}\) to increase. To counteract this problem both the AIC and BIC introduce a penalty term that is meant to punish models that are built with a high number of predictors (predictors is synonymous with covariates). AIC uses the \(2p\) penalty and BIC uses the \(plog(n)\) penalty. Unfortunately for us using the AIC and BIC requires for their to be an actual likelihood. GEE’s don’t have likelihood’s attached to them so it would be non-sensical for us to use either AIC or BIC to select a model.
One way to work around this issue is through the introduction of the Quasi-likelihood Information Criterion (QIC). Developed by Pan (2001), the QIC is a modification of the AIC for models fitted by GEE. Using the QIC() function from the geepack package we can calculate the QIC’s of each model and see which of the five has the minimum QIC.
<- data.frame(`Model Name` = c("Model 1", "Model 2", "Model 3", "Model 4",
QIC_comparisons "Model 5"), QIC = c(QIC(model_1)[1], QIC(model_2)[1], QIC(model_3)[1], QIC(model_4)[1],
QIC(model_5)[1]))
QIC_comparisons
Model.Name QIC
1 Model 1 17225
2 Model 2 22419
3 Model 3 23441
4 Model 4 20927
5 Model 5 19397
From the above we can see that Model 1 has the smallest QIC out of the five models.