5  Creating “Noisy” Data

library(dplyr)
library(tidyverse)
library(tibble)
library(ggplot2)
library(knitr)
library(hrbrthemes)
library(viridis)

library(devtools)
#using data specified in this github repository:
install_github("jhs-hwg/cardioStatsUSA")
library(cardioStatsUSA)
#to prevent errors, exclude the rows with na
used_vars = c('cc_diabetes', 'bp_sys_mean', 'demo_age_years', 'demo_race', 'demo_gender', 'cc_bmi', 'cc_smoke', 'bp_med_use')
clean_nhanes <- nhanes_data[complete.cases(nhanes_data[,..used_vars]), ]

In normal usage of measurement error techniques, the data is assumed to have systematic error arising from measurement of the variables which we aim to remedy. In our case, we believe that the NHANES data has no measurement error, so we will instead simulate error by adding in random noise to the existing data to create a “noisy” dataset.

First, let’s remind ourselves the relationship between blood pressure and diabetes visually:

clean_nhanes %>%
  ggplot(aes(x=bp_sys_mean, color=cc_diabetes)) +
    geom_histogram(fill="white", alpha=0.5, bins = 80) +
    ggtitle("Blood Pressure with Diabetes Histogram")

clean_nhanes %>%
  ggplot(aes(x=cc_diabetes, y=bp_sys_mean, fill=cc_diabetes)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Blood Pressure VS Diabetes Status") +
    xlab("")

With this much data, the true true statistic (in this case, the mean blood pressure values for both groups) will be clear no matter how much noise we add. However, in normal circumstances we would not have this much data. We can instead mimic more realistic scenarios by taking a smaller subset of this data to examine. The goal will be to add noise to obscure the relationship between diabetes and blood pressure, and then use measurement error correction to rediscover the true relationship.

Let’s start by taking a subset of n=150. Since in the total data set, about 13% of individuals had diabetes, we will keep this ratio similar here.

set.seed(19)
#original: seed 52, subset yes = 20, subset no = 130

subset_diab_yes <- subset(clean_nhanes, cc_diabetes == "Yes") 
subset_diab_no <- subset(clean_nhanes, cc_diabetes == "No")

sample_diab_yes <- subset_diab_yes[sample(1:nrow(subset_diab_yes), 78, replace=FALSE),]
sample_diab_no <- subset_diab_no[sample(1:nrow(subset_diab_no), 522, replace=FALSE),]

subset_nhanes <- rbind(sample_diab_yes, sample_diab_no)
head(subset_nhanes)
svy_id svy_weight_mec svy_psu svy_strata svy_year svy_subpop_htn svy_subpop_chol demo_age_cat demo_race demo_race_black demo_age_years demo_pregnant demo_gender bp_sys_mean bp_dia_mean bp_cat_meds_excluded bp_cat_meds_included bp_control_jnc7 bp_control_accaha bp_control_escesh_1 bp_control_escesh_2 bp_control_140_90 bp_control_130_80 bp_uncontrolled_jnc7 bp_uncontrolled_accaha bp_uncontrolled_escesh_1 bp_uncontrolled_escesh_2 bp_uncontrolled_140_90 bp_uncontrolled_130_80 bp_med_use bp_med_recommended_jnc7 bp_med_recommended_accaha bp_med_recommended_escesh bp_med_n_class bp_med_n_pills bp_med_combination bp_med_pills_gteq_2 bp_med_ace bp_med_aldo bp_med_alpha bp_med_angioten bp_med_beta bp_med_central bp_med_ccb bp_med_ccb_dh bp_med_ccb_ndh bp_med_diur_Ksparing bp_med_diur_loop bp_med_diur_thz bp_med_renin_inhibitors bp_med_vasod htn_jnc7 htn_accaha htn_escesh htn_aware htn_resistant_jnc7 htn_resistant_accaha htn_resistant_jnc7_thz htn_resistant_accaha_thz chol_measured_never chol_measured_last chol_total chol_total_gteq_200 chol_total_gteq_240 chol_hdl chol_hdl_low chol_trig chol_trig_gteq_150 chol_ldl chol_ldl_5cat chol_ldl_lt_70 chol_ldl_gteq_70 chol_ldl_lt_100 chol_ldl_gteq_100 chol_ldl_gteq_190 chol_ldl_persistent chol_nonhdl chol_nonhdl_5cat chol_nonhdl_lt_100 chol_nonhdl_gteq_100 chol_nonhdl_gteq_220 chol_med_use chol_med_use_sr chol_med_statin chol_med_ezetimibe chol_med_pcsk9i chol_med_bile chol_med_fibric_acid chol_med_atorvastatin chol_med_simvastatin chol_med_rosuvastatin chol_med_pravastatin chol_med_pitavastatin chol_med_fluvastatin chol_med_lovastatin chol_med_other chol_med_addon_use chol_med_addon_recommended_ahaacc chol_med_statin_recommended_ahaacc chol_med_recommended_ever ascvd_risk_vh_ahaacc cc_smoke cc_bmi cc_diabetes cc_ckd cc_acr cc_egfr cc_hba1c cc_egfr_lt60 cc_acr_gteq30 cc_cvd_mi cc_cvd_chd cc_cvd_stroke cc_cvd_ascvd cc_cvd_hf cc_cvd_any
37436 22062.121 2 47 2005-2006 1 1 75+ Non-Hispanic White No 77 No Men 134.0000 60.66667 SBP of 130 to <140 or DBP 80 to <90 mm Hg taking antihypertensive medications No No Yes Yes Yes No Yes Yes No No No Yes Yes Yes Yes Yes Three Three No Yes Yes No Yes No No No No No No No Yes No No No Yes Yes Yes Yes Yes Yes No No Cholesterol has been measured previously In the past year 136 No No 23 Yes 154 Yes 84.21057 70 to <100 mg/dL No Yes Yes No No No 113 100 to <130 mg/dL No Yes No Yes Yes No No No No Yes No No No No No No No No No Yes Yes Yes Yes Never 35+ Yes Yes 95.873016 59.00152 6.6 Yes Yes Yes Yes Yes Yes Yes Yes
74375 12536.217 1 111 2013-2014 1 0 45 to 64 Hispanic No 63 No Women 171.3333 72.66667 SBP 160+ or DBP 100+ mm Hg taking antihypertensive medications No No No No No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes One One No No Yes No No No No No No No No No No No No No Yes Yes Yes Yes No No No No NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Former <25 Yes Yes 141.176471 108.03135 11.2 No Yes No No No No No No
124796 4667.865 1 167 2017-2020 1 0 45 to 64 Non-Hispanic Black Yes 61 No Women 133.1667 75.70000 SBP of 130 to <140 or DBP 80 to <90 mm Hg taking antihypertensive medications No No Yes No Yes No Yes Yes No Yes No Yes Yes Yes Yes Yes Four or more Three Yes Yes Yes No No No No No Yes Yes No No No Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Never 35+ Yes No 4.666667 85.04965 6.9 No No No No No No No No
56088 24882.496 2 75 2009-2010 1 0 18 to 44 Hispanic No 40 No Women 136.6667 70.00000 SBP of 130 to <140 or DBP 80 to <90 mm Hg taking antihypertensive medications No No Yes No Yes No Yes Yes No Yes No Yes Yes Yes Yes Yes Two Two No Yes Yes No No No No No No No No No No Yes No No Yes Yes Yes Yes No No No No NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Never 35+ Yes No 4.696356 115.83310 6.7 No No No No No No No No
74958 17328.399 1 108 2013-2014 1 0 75+ Hispanic No 76 No Women 177.3333 48.00000 SBP 160+ or DBP 100+ mm Hg taking antihypertensive medications No No No No No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Two Two No Yes No No No No Yes Yes No No No No No No No No Yes Yes Yes Yes No No No No NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Former 25 to <30 Yes No 9.871795 65.38231 6.9 No No No No No No No No
89925 11090.360 1 128 2015-2016 1 0 65 to 74 Hispanic No 68 No Women 132.0000 66.00000 SBP of 130 to <140 or DBP 80 to <90 mm Hg taking antihypertensive medications No No Yes Yes Yes No Yes Yes No No No Yes Yes Yes Yes Yes Four or more Four or more No Yes Yes No No No Yes No Yes Yes No No No Yes No No Yes Yes Yes Yes Yes Yes Yes Yes NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Never 35+ Yes No 12.426778 79.01981 7.5 No No No No No No No No
subset_nhanes %>%
  ggplot(aes(x=bp_sys_mean, color=cc_diabetes)) +
    geom_histogram(fill="white", alpha=0.5, bins = 12) +
    ggtitle("Blood Pressure with Diabetes Histogram")

subset_nhanes%>%
  ggplot(aes(x=cc_diabetes, y=bp_sys_mean, fill=cc_diabetes)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Blood Pressure VS Diabetes Status") +
    xlab("")

We can use a t-test to evaluate whether or not the two populations (diabetes and non-diabetes) have significantly different distributions of blood pressures:

nhanes_sys_diabetes <- subset_nhanes %>% select(cc_diabetes, bp_sys_mean) %>% drop_na(bp_sys_mean) %>% drop_na(cc_diabetes)
diabetes_test <- t.test(bp_sys_mean ~ cc_diabetes, data = subset_nhanes)
diabetes_test

    Welch Two Sample t-test

data:  bp_sys_mean by cc_diabetes
t = -5.1725, df = 97.158, p-value = 1.24e-06
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 -17.043266  -7.591128
sample estimates:
 mean in group No mean in group Yes 
         122.2213          134.5385 

Here we can see that with a p-value of 0.007, there is a significant difference between the two populations. The diabetes group has a mean blood pressure of about 137, while the non-diabetes group has a mean of about 124.

Now we want to add in noise to the data to simulate making the measurements less accurate. We can achieve this by sampling from a normal distribution centered on 0 and adding the resulting value to the original data measurement. This will mask the patient’s true blood pressure value.

We will experiment with 3 values for “reliability”: the higher the value, the lower the variance of the distribution from which we sample noise, and the closer to the original data the noisy data tends to be.

First, let’s try a value of 0.3:

    reliability <- 0.3  ### Set up measurement error with 0.5 Attenuation coef
    sigma_u_sq <- 1/reliability - 1
    
    sigma_u_sq
[1] 2.333333
    sigma_u_sq^0.5
[1] 1.527525

We can see that a low reliability value results in a variance of 2.33 and a standard deviation of about 1.53. Next, let’s increase reliability to 0.5

    reliability <- 0.5
    sigma_u_sq <- 1/reliability - 1
    
    sigma_u_sq
[1] 1
    sigma_u_sq^0.5
[1] 1

When we increase the reliability, the variance and standard deviation both decrease to 1. This will ultimately result in a bit less change to the original data.

Finally, let’s look at reliability of 0.7

    reliability <- 0.7
    sigma_u_sq <- 1/reliability - 1
    
    sigma_u_sq
[1] 0.4285714
    sigma_u_sq^0.5
[1] 0.6546537

With a variance of 0.43 and standard deviation of 0.65, this reliability value creates the least noise compared to 0.3 and 0.5.

Now, let’s actually transform the data we have and visualize:

set.seed(105)
n = nrow(subset_nhanes)

reliability <- 0.0005   ### Set up measurement error with 0.5 Attenuation coef
sigma_u_sq <- 1/reliability - 1
subset_nhanes$bp_sys_mean_noise_low <- subset_nhanes$bp_sys_mean + rnorm(n, sd=sigma_u_sq^0.5)
subset_nhanes$bp_sys_mean_noise_low <- abs(subset_nhanes$bp_sys_mean_noise_low)

reliability <- 0.005
sigma_u_sq <- 1/reliability - 1
subset_nhanes$bp_sys_mean_noise_med <- subset_nhanes$bp_sys_mean + rnorm(n, sd=sigma_u_sq^0.5)

reliability <- 0.1
sigma_u_sq <- 1/reliability - 1
subset_nhanes$bp_sys_mean_noise_high <- subset_nhanes$bp_sys_mean + rnorm(n, sd=sigma_u_sq^0.5)

Let’s visualize the difference between the raw BP measurements and the measurements with new error added in:

#X axis = Blood Pressure
#Y axis = BP + Noise
#Title = Low, Moderate, High Reliability (0.25), for example
scatterplot <- ggplot(subset_nhanes, aes(x=bp_sys_mean, y=bp_sys_mean_noise_low)) + 
    geom_point(size=0.5) +
    ggtitle("Low Reliability") +
    xlab("Blood Pressure") +
    ylab("BP + Noise")
    
scatterplot + annotate("segment", x = 75, xend = 200, y = 75, yend = 200,
  colour = "red")

scatterplot <- ggplot(subset_nhanes, aes(x=bp_sys_mean, y=bp_sys_mean_noise_med)) + 
    geom_point(size=0.5) +
    ggtitle("Medium Reliability") +
    xlab("Blood Pressure") +
    ylab("BP + Error")
    
scatterplot + annotate("segment", x = 75, xend = 200, y = 75, yend = 200,
  colour = "red")

scatterplot <- ggplot(subset_nhanes, aes(x=bp_sys_mean, y=bp_sys_mean_noise_high)) + 
    geom_point(size=0.5) +
    ggtitle("High Reliability") +
    xlab("Blood Pressure") +
    ylab("BP + Error")
    
scatterplot + annotate("segment", x = 75, xend = 200, y = 75, yend = 200,
  colour = "red")

Comparing the values created by setting reliability to 0.025 and 0.25, we can see that the spread of the scatter plot is much different. The added amount of noise in the high reliability case does not shift the data points very far off from their original positions compared to the low reliability case.

We can also measure the “spread” of the noise by calculating the correlation coefficient. This will give us a numerical value for how linked the two variables are:

In some cases:

Notice that in the low reliability case, so much noise here is added that there are a few data points with a blood pressure value below 0. Data with this much error in it obviously wouldn’t be used in the real world, but for the sake of demonstrating the effectiveness of measurement error correction, we will continue to use this data.

print(paste("Reliability Low Correlation Coefficient:", cor(subset_nhanes$bp_sys_mean, subset_nhanes$bp_sys_mean_noise_low))) 
[1] "Reliability Low Correlation Coefficient: 0.371898162992643"
print(paste("Reliability Medium Correlation Coefficient:", cor(subset_nhanes$bp_sys_mean, subset_nhanes$bp_sys_mean_noise_med))) 
[1] "Reliability Medium Correlation Coefficient: 0.789073180735782"
print(paste("Reliability High Correlation Coefficient:", cor(subset_nhanes$bp_sys_mean, subset_nhanes$bp_sys_mean_noise_high)))
[1] "Reliability High Correlation Coefficient: 0.987158951684439"

Finally, let’s look again at a t-test to see if the relationship between diabetes status and noisy blood pressure is any different than the non-noisy data.

nhanes_sys_diabetes <- subset_nhanes %>% select(cc_diabetes, bp_sys_mean) %>% drop_na(bp_sys_mean) %>% drop_na(cc_diabetes)
diabetes_test <- t.test(bp_sys_mean_noise_low ~ cc_diabetes, data = subset_nhanes)
diabetes_test

    Welch Two Sample t-test

data:  bp_sys_mean_noise_low by cc_diabetes
t = -1.8332, df = 101.26, p-value = 0.06971
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 -21.4872072   0.8470414
sample estimates:
 mean in group No mean in group Yes 
         123.0366          133.3566 
nhanes_sys_diabetes <- subset_nhanes %>% select(cc_diabetes, bp_sys_mean) %>% drop_na(bp_sys_mean) %>% drop_na(cc_diabetes)
diabetes_test <- t.test(bp_sys_mean_noise_med ~ cc_diabetes, data = subset_nhanes)
diabetes_test

    Welch Two Sample t-test

data:  bp_sys_mean_noise_med by cc_diabetes
t = -4.3257, df = 105.5, p-value = 3.462e-05
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 -16.502997  -6.129374
sample estimates:
 mean in group No mean in group Yes 
         122.4257          133.7419 
nhanes_sys_diabetes <- subset_nhanes %>% select(cc_diabetes, bp_sys_mean) %>% drop_na(bp_sys_mean) %>% drop_na(cc_diabetes)
diabetes_test <- t.test(bp_sys_mean_noise_high ~ cc_diabetes, data = subset_nhanes)
diabetes_test

    Welch Two Sample t-test

data:  bp_sys_mean_noise_high by cc_diabetes
t = -5.0414, df = 96.81, p-value = 2.152e-06
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 -16.935141  -7.367393
sample estimates:
 mean in group No mean in group Yes 
         122.3786          134.5299 

We can see that in the low reliability case, enough noise was added that the results of the t-test are no longer significant, as the p-value is higher than 0.05. In the medium reliability case, the results are still significant, but much less so, with the p-value doubling from 0.02 to 0.04. Finally, in the high reliability case, the results of the t-test are not much different than when using the raw data. This is because the amount of noise added was quite low.

Now, we will try to fit a linear regression model to estimate the raw measurement from the “error”-full measurement.

error_model_0.3 <- glm(bp_sys_mean ~ bp_sys_mean_noise_low, data = subset_nhanes, family = 'gaussian')
summary(error_model_0.3)

Call:
glm(formula = bp_sys_mean ~ bp_sys_mean_noise_low, family = "gaussian", 
    data = subset_nhanes)

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           105.08228    2.04124  51.480   <2e-16 ***
bp_sys_mean_noise_low   0.15067    0.01538   9.797   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 304.6497)

    Null deviance: 211422  on 599  degrees of freedom
Residual deviance: 182181  on 598  degrees of freedom
AIC: 5138.2

Number of Fisher Scoring iterations: 2
error_model_0.5 <- glm(bp_sys_mean ~ bp_sys_mean_noise_med, data = subset_nhanes, family = 'gaussian')
summary(error_model_0.5)

Call:
glm(formula = bp_sys_mean ~ bp_sys_mean_noise_med, family = "gaussian", 
    data = subset_nhanes)

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           44.15721    2.57965   17.12   <2e-16 ***
bp_sys_mean_noise_med  0.64300    0.02047   31.41   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 133.4162)

    Null deviance: 211422  on 599  degrees of freedom
Residual deviance:  79783  on 598  degrees of freedom
AIC: 4642.8

Number of Fisher Scoring iterations: 2
error_model_0.7 <- glm(bp_sys_mean ~ bp_sys_mean_noise_high, data = subset_nhanes, family = 'gaussian')
summary(error_model_0.7)

Call:
glm(formula = bp_sys_mean ~ bp_sys_mean_noise_high, family = "gaussian", 
    data = subset_nhanes)

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)              2.0698     0.8149    2.54   0.0113 *  
bp_sys_mean_noise_high   0.9822     0.0065  151.12   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 9.021565)

    Null deviance: 211421.9  on 599  degrees of freedom
Residual deviance:   5394.9  on 598  degrees of freedom
AIC: 3026.5

Number of Fisher Scoring iterations: 2
subset_nhanes %>%
  ggplot(aes(x=bp_sys_mean_noise_low, color=cc_diabetes)) +
    geom_histogram(fill="white", alpha=0.5, bins = 12) +
    ggtitle("Blood Pressure with Diabetes Histogram")

subset_nhanes %>%
  ggplot(aes(x=cc_diabetes, y=bp_sys_mean_noise_low, fill=cc_diabetes)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Blood Pressure VS Diabetes Status") +
    xlab("")

We can see in both the histogram and especially in the box plot, the noise has made it so that the two group’s distributions are virtually indistinguishable. This mimics what may happen in the real world: although the underlying distribution of two groups may be different, error in measurement may mask this fact so that the data given looks very similar. Had we been given the noisy data and performed a t-test without accounting for this error, we would come to the incorrect conclusion that diabetes and blood pressure are not linked.

This is the main issue that measurement error seeks to correct. By using it, we can avoid drawing incorrect conclusions about our data.

#Store our dataframe:
saveRDS(subset_nhanes, "nhanes_subset.rds")