1  Exploratory Data Analysis

Learning Objectives

  1. Understand how to conduct exploratory data analysis for longitudinal data.

  2. Understand how to create plots to visualize trends over time and interpret results.

  3. Understand how to calculate descriptive statistics (mean, standard deviation, variance-covariance matrix, correlation matrix) and interpret results.

  4. Understand how to identify characteristics of longitudinal data.

1.1 Converting between data formats (wide and long format)

For the most part there are two formats that your data can come in. The wide format and the long format. The long format is when patients within the data have more than one observation. In other words, each row is snapshot into a subject’s history at a specific time point. In the case of our data, each subject has four observations corresponding to their four lead measurements (initial measurement, 1 week measurement, 4 week measurement, and 6 week measurement). The code below details how to convert from wide format to long format.

The arguments to gather():

  • data: Data object (e.g. the data object here is TLC).
  • key: Name of new key column (made from names of data columns).
  • value: Name of new value column.
  • ...: Names of source columns that contain values.
  • factor_key: Treat the new key column as a factor (instead of character vector).

Then, we print the first 16 rows of the long format TLC data.

long_TLC <- tidyr::gather(TLC, level, measurements, lead0:lead6, factor_key = TRUE)
long_TLC <- long_TLC[order(long_TLC$id), ]
head(long_TLC, n = 16)
    id     group level measurements
1    1   Placebo lead0         30.8
101  1   Placebo lead1         26.9
201  1   Placebo lead4         25.8
301  1   Placebo lead6         23.8
2    2 Treatment lead0         26.5
102  2 Treatment lead1         14.8
202  2 Treatment lead4         19.5
302  2 Treatment lead6         21.0
3    3 Treatment lead0         25.8
103  3 Treatment lead1         23.0
203  3 Treatment lead4         19.1
303  3 Treatment lead6         23.2
4    4   Placebo lead0         24.7
104  4   Placebo lead1         24.5
204  4   Placebo lead4         22.0
304  4   Placebo lead6         22.5

The wide format is when each row corresponds to a unique subject. The below shows one how to convert from long format to wide.

The arguments to spread():

  • data: Data object.
  • key: Name of column containing the new column names.
  • value: Name of column containing values.

Then, we print the first 10 rows of the converted wide format TLC data which should be same as our original data.

wide_TLC <- spread(long_TLC, level, measurements)
head(wide_TLC, n = 10)
   id     group lead0 lead1 lead4 lead6
1   1   Placebo  30.8  26.9  25.8  23.8
2   2 Treatment  26.5  14.8  19.5  21.0
3   3 Treatment  25.8  23.0  19.1  23.2
4   4   Placebo  24.7  24.5  22.0  22.5
5   5 Treatment  20.4   2.8   3.2   9.4
6   6 Treatment  20.4   5.4   4.5  11.9
7   7   Placebo  28.6  20.8  19.2  18.4
8   8   Placebo  33.7  31.6  28.5  25.1
9   9   Placebo  19.7  14.9  15.3  14.7
10 10   Placebo  31.1  31.2  29.2  30.1

1.2 Graphical Representation of Longitudinal Data

In this section we will create some visual representations of the data. While this is possible to do in base R, we will use ggplot. The plots look nicer and it is a reliable tool for making data visualizations. We will begin by plotting the blood lead level (BLL) trajectories for the first seven subjects.

1.2.1 Individual Trajectory Plot

We need to utilize ggplot package to graph the individual trajectory plot. Before that, we firstly make some modification on our data. We create a new numeric column named time corresponding to the level column which represents the number of weeks for the measurement as our x-axis timing variable. Next, we convert the id column into factor so we can have each individual as a group.

# create a new numeric timing variable
long_TLC$time <- c(0, 1, 4, 6)[long_TLC$level]

# convert the id column into factor for grouping
long_TLC$id <- as.factor(long_TLC$id)

Next, we are reading to graph the individual trajectory plot, and we only focus on the first 7 id’s individual.

# create plot
lead_trajectories <- ggplot(data = long_TLC[(long_TLC[,"id"] %in% 1:7), ]) + #only focus on id's 1-7
  geom_line(aes(x = time, y = measurements, color = id, group = id),size = 1.7) + 
  theme(axis.line = element_line(colour = "black", size = 2),
        text = element_text(size = 20),
        axis.text = element_text(colour = "black", size = 16, face = "bold"),
        axis.title = element_text(size = 16, face="bold"),
        axis.ticks.length = unit(.25, "cm"),
        axis.ticks = element_line(colour = "black", size = 1.5),
        legend.background = element_blank()) +
  scale_color_manual(name = "ID", values = c("green", "red", "purple", "blue",
                                             "yellow", "pink", "orange"),
                     labels = sapply(1:7, function(x) paste0("id", " = ", x))) +
  ylab(~ paste("Blood Lead levels (", mu, "g/dL)")) +
  xlab("Time (weeks)")

# print the individual trajectory plot
print(lead_trajectories)

1.2.2 Scatter Plot

Next, we create a scatter plot to evaluate BLL over time. We stratify the data by group.

long_TLC$factor_time <- as.factor(long_TLC$time)

#create plot
lead_point_plot <- ggplot(long_TLC, 
                          aes(x = factor_time, y = measurements, color = group)) + 
  geom_point() + 
  facet_wrap(.~group) + # this allows us to make separate boxes for the groups
  theme(axis.line = element_line(colour = "black", size = 2),
        text = element_text(size = 20),
        axis.text = element_text(colour = "black", size = 16, face="bold"),
        axis.title = element_text(size = 16, face = "bold"),
        axis.ticks.length=unit(.25, "cm"),
        axis.ticks = element_line(colour = "black", size = 1.5)) + 
  ylab(~ paste("Blood Lead levels (", mu, "g/dL)")) + 
  xlab("Time (weeks)")

# print the scatter plot
print(lead_point_plot)

1.2.3 Mean Plot

To plot averages over time, we first summarize the data. Here we summarize count, mean, standard deviation (SD) and variance of blood lead levels over time (or for each occasion).

#create table summarizing the blood lead levels
lead_overall_summary <- long_TLC %>%
  group_by(time) %>% #CHECK
  summarise(n = (length(measurements) - sum(is.na(measurements))),
            mean = round(mean(measurements, na.rm = T), 3),
            sd = round(sd(measurements, na.rm = T), 3),
            var = round(var(measurements, na.rm = T), 3))

#output table for overall averages
lead_overall_summary %>%
  mutate_all(linebreak) %>%
  kbl(caption = "Summary of average lead levels from TLC study",
      col.names=linebreak(c("Time","N", "Mean", "SD", "Variance")),
      booktabs=T, escape=F, align = "c") %>%
  kable_styling(full_width = FALSE, latex_options = c('hold_position'))
Summary of average lead levels from TLC study
Time N Mean SD Variance
0 100 26.406 4.999 24.989
1 100 19.091 8.673 75.225
4 100 19.792 8.086 65.385
6 100 22.204 7.756 60.159

Next, we create a plot of means.

lead_mean_plot <- ggplot(data = lead_overall_summary, aes(x = time, y = mean)) +
    geom_point(size = 2) + geom_line(size = 2) + ylab(~paste("Lead levels (", mu,
    "g/dL)")) + xlab("Time (weeks)")

# print the mean plot
print(lead_mean_plot)

1.2.4 Box Plot

Another helpful method of visualizing the data is to use a boxplot. A boxplot is a standardized way of displaying the distribution of data based on five statistics: The minimum (bottom line), the 1st quartile (bottom of the box), the median (the line inside the box), the 3rd quartile (top of the box) and the maximum (top line). Sometimes you will encounter values that are above the maximum or below the minimum. These are known as outliers. This theoretically shouldn’t be true, but it occurs due to how the min and max are defined. The maximum is defined as the third quartile plus 1.5 times the InterQuartile Range (3rd quartile minus 1st quartile), and the minimum is defined as the 1st quartile minus 1.5 times the IQR.

lead_box_plot <- ggplot(long_TLC, aes(x = factor_time, y = measurements, fill = group)) +
  geom_boxplot() +
  facet_wrap(.~group) + #this is what allows us to make separate boxes for the groups
  theme(axis.line = element_line(colour = "black", linewidth = 2),
        text = element_text(size = 20),
        axis.text = element_text(colour = "black", size = 20, face = "bold"),
        axis.title = element_text(size = 24, face ="bold"),
        axis.ticks.length=unit(.25, "cm"),
        axis.ticks = element_line(colour = "black", linewidth = 1.5)) +
  ylab(~ paste(" Lead levels ( ", mu, "g/dL )")) +
  xlab("Time (weeks)")

# print the boxplot
print(lead_box_plot)

1.2.5 Correlation Plot

Next, we can create a correlation plot using GGally package.

ggpairs(TLC, columns = c(2:5))

1.3 Descriptive statistics for longitudinal data

In this section we will calculate some summary statistics for our continuous covariates. For simplicity I will be using the wide format of the data since every observation (i.e. row) will correspond to a unique id. There are many functions in R which will calculate any type of summary statistics you can think of. The most general is the summary() function in base R which calculates the minimum, 1st Quartile (25% Percentile), Median, Mean, 3rd Quartile (75%) and Maximum.

1.3.1 Calculate summary statistics by group

# Use by() function (in base R) to calculate summary statistics by Group
by(TLC[, c("lead0", "lead1", "lead4", "lead6")], TLC[, "group"], FUN = summary)
TLC[, "group"]: Placebo
     lead0           lead1           lead4           lead6      
 Min.   :19.70   Min.   :14.90   Min.   :15.30   Min.   :13.50  
 1st Qu.:21.88   1st Qu.:20.93   1st Qu.:19.82   1st Qu.:19.95  
 Median :25.25   Median :24.10   Median :22.45   Median :22.35  
 Mean   :26.27   Mean   :24.66   Mean   :24.07   Mean   :23.65  
 3rd Qu.:29.73   3rd Qu.:27.82   3rd Qu.:27.45   3rd Qu.:27.50  
 Max.   :38.10   Max.   :40.80   Max.   :38.60   Max.   :43.30  
------------------------------------------------------------ 
TLC[, "group"]: Treatment
     lead0           lead1            lead4            lead6      
 Min.   :19.70   Min.   : 2.800   Min.   : 3.000   Min.   : 4.10  
 1st Qu.:22.12   1st Qu.: 7.225   1st Qu.: 9.125   1st Qu.:15.40  
 Median :26.20   Median :12.250   Median :15.350   Median :18.85  
 Mean   :26.54   Mean   :13.522   Mean   :15.514   Mean   :20.76  
 3rd Qu.:29.55   3rd Qu.:17.500   3rd Qu.:19.725   3rd Qu.:23.75  
 Max.   :41.10   Max.   :39.000   Max.   :40.400   Max.   :63.90  

1.3.2 Calculate summary statistics for all BLL data (not as groups)

lead_all <- c(TLC$lead0, TLC$lead1, TLC$lead4, TLC$lead6)
summary(lead_all)  #print summary statistics
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.80   17.50   21.90   21.87   26.73   63.90 

If you’re in need of even more summary statistics you can use the stat.desc package from the pastecs library.stat.desc provides you with the additional following descriptive statistics: the number of values (nbr.val), the number of null values (nbr.null), the number of missing values (nbr.na), the minimal value (min), the maximal value (max), the range (range, that is, max-min) and the sum of all non-missing values (sum), the median (median), the mean (mean), the standard error on the mean (SE.mean), the confidence interval of the mean (CI.mean) at the p level, the variance (var), the standard deviation (std.dev) and the variation coefficient (coef.var) defined as the standard deviation divided by the mean.

### Calculate summary statistics by group
library(pastecs)
# Use by() function (in base R) to calculate summary statistics by Group
by(TLC[, c("lead0", "lead1", "lead4", "lead6")], TLC[, "group"], FUN = stat.desc)
TLC[, "group"]: Placebo
                    lead0        lead1        lead4        lead6
nbr.val        50.0000000   50.0000000   50.0000000   50.0000000
nbr.null        0.0000000    0.0000000    0.0000000    0.0000000
nbr.na          0.0000000    0.0000000    0.0000000    0.0000000
min            19.7000010   14.9000000   15.3000000   13.5000000
max            38.0999980   40.7999990   38.5999980   43.2999990
range          18.3999970   25.8999990   23.2999980   29.7999990
sum          1313.5999970 1232.9999970 1203.5000030 1182.2999980
median         25.2500000   24.0999995   22.4500010   22.3500005
mean           26.2719999   24.6599999   24.0700001   23.6460000
SE.mean         0.7105160    0.7723275    0.8136150    0.7975893
CI.mean.0.95    1.4278353    1.5520502    1.6350205    1.6028157
var            25.2416481   29.8244892   33.0984669   31.8074323
std.dev         5.0241067    5.4611802    5.7531267    5.6398078
coef.var        0.1912343    0.2214591    0.2390165    0.2385100
------------------------------------------------------------ 
TLC[, "group"]: Treatment
                    lead0       lead1       lead4        lead6
nbr.val        50.0000000  50.0000000  50.0000000   50.0000000
nbr.null        0.0000000   0.0000000   0.0000000    0.0000000
nbr.na          0.0000000   0.0000000   0.0000000    0.0000000
min            19.7000010   2.8000000   3.0000000    4.0999999
max            41.0999980  39.0000000  40.4000020   63.9000020
range          21.3999970  36.2000000  37.4000020   59.8000021
sum          1326.9999960 676.0999983 775.6999992 1038.1000055
median         26.2000000  12.2500000  15.3500000   18.8499995
mean           26.5399999  13.5220000  15.5140000   20.7620001
SE.mean         0.7100675   1.0850535   1.1104697    1.3076288
CI.mean.0.95    1.4269341   2.1804967   2.2315724    2.6277784
var            25.2097935  58.8670567  61.6571480   85.4946530
std.dev         5.0209355   7.6724870   7.8522066    9.2463319
coef.var        0.1891837   0.5674077   0.5061368    0.4453488

What do you observe in the variance-covariance matrices below? What happens when the units change? Write 1-3 sentences.

### Covariance Matrix for lead levels in ug/dL
round(cov(TLC[, c("lead0", "lead1", "lead4", "lead6")]), 5)
         lead0    lead1    lead4    lead6
lead0 24.98905 18.16066 18.92146 21.78220
lead1 18.16066 75.22487 59.24104 37.48690
lead4 18.92146 59.24104 65.38539 36.54235
lead6 21.78220 37.48690 36.54235 60.15898
### Correlation Matrix for lead levels in mg/dL
round(cov(TLC[, c("lead0", "lead1", "lead4", "lead6")] * 10^-3), 6)
        lead0   lead1   lead4   lead6
lead0 2.5e-05 1.8e-05 1.9e-05 2.2e-05
lead1 1.8e-05 7.5e-05 5.9e-05 3.7e-05
lead4 1.9e-05 5.9e-05 6.5e-05 3.7e-05
lead6 2.2e-05 3.7e-05 3.7e-05 6.0e-05

What do you observe in the correlation matrices below? Write 1-3 sentences.

### Correlation Matrix for lead levels in ug/dL
round(cor(TLC[, c("lead0", "lead1", "lead4", "lead6")]), 3)
      lead0 lead1 lead4 lead6
lead0 1.000 0.419 0.468 0.562
lead1 0.419 1.000 0.845 0.557
lead4 0.468 0.845 1.000 0.583
lead6 0.562 0.557 0.583 1.000
### Correlation Matrix for lead levels in mg/dL
round(cor(TLC[, c("lead0", "lead1", "lead4", "lead6")] * 10^-3), 3)
      lead0 lead1 lead4 lead6
lead0 1.000 0.419 0.468 0.562
lead1 0.419 1.000 0.845 0.557
lead4 0.468 0.845 1.000 0.583
lead6 0.562 0.557 0.583 1.000