library(dplyr)
library(tidyverse)
library(tibble)
library(ggplot2)
library(knitr)
library(hrbrthemes)
library(viridis)
library(devtools)
#using data specified in this github repository:
install_github("jhs-hwg/cardioStatsUSA")
library(cardioStatsUSA)
5 Creating “Noisy” Data
#to prevent errors, exclude the rows with na
= c('cc_diabetes', 'bp_sys_mean', 'demo_age_years', 'demo_race', 'demo_gender', 'cc_bmi', 'cc_smoke', 'bp_med_use')
used_vars <- nhanes_data[complete.cases(nhanes_data[,..used_vars]), ] clean_nhanes
In normal usage of measurement error techniques, the data is assumed to have systematic error arising from measurement of the variables which we aim to remedy. In our case, we believe that the NHANES data has no measurement error, so we will instead simulate error by adding in random noise to the existing data to create a “noisy” dataset.
First, let’s remind ourselves the relationship between blood pressure and diabetes visually:
%>%
clean_nhanes ggplot(aes(x=bp_sys_mean, color=cc_diabetes)) +
geom_histogram(fill="white", alpha=0.5, bins = 80) +
ggtitle("Blood Pressure with Diabetes Histogram")
%>%
clean_nhanes ggplot(aes(x=cc_diabetes, y=bp_sys_mean, fill=cc_diabetes)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
+
) ggtitle("Blood Pressure VS Diabetes Status") +
xlab("")
With this much data, the true true statistic (in this case, the mean blood pressure values for both groups) will be clear no matter how much noise we add. However, in normal circumstances we would not have this much data. We can instead mimic more realistic scenarios by taking a smaller subset of this data to examine. The goal will be to add noise to obscure the relationship between diabetes and blood pressure, and then use measurement error correction to rediscover the true relationship.
Let’s start by taking a subset of n=150. Since in the total data set, about 13% of individuals had diabetes, we will keep this ratio similar here.
set.seed(19)
#original: seed 52, subset yes = 20, subset no = 130
<- subset(clean_nhanes, cc_diabetes == "Yes")
subset_diab_yes <- subset(clean_nhanes, cc_diabetes == "No")
subset_diab_no
<- subset_diab_yes[sample(1:nrow(subset_diab_yes), 78, replace=FALSE),]
sample_diab_yes <- subset_diab_no[sample(1:nrow(subset_diab_no), 522, replace=FALSE),]
sample_diab_no
<- rbind(sample_diab_yes, sample_diab_no)
subset_nhanes head(subset_nhanes)
svy_id | svy_weight_mec | svy_psu | svy_strata | svy_year | svy_subpop_htn | svy_subpop_chol | demo_age_cat | demo_race | demo_race_black | demo_age_years | demo_pregnant | demo_gender | bp_sys_mean | bp_dia_mean | bp_cat_meds_excluded | bp_cat_meds_included | bp_control_jnc7 | bp_control_accaha | bp_control_escesh_1 | bp_control_escesh_2 | bp_control_140_90 | bp_control_130_80 | bp_uncontrolled_jnc7 | bp_uncontrolled_accaha | bp_uncontrolled_escesh_1 | bp_uncontrolled_escesh_2 | bp_uncontrolled_140_90 | bp_uncontrolled_130_80 | bp_med_use | bp_med_recommended_jnc7 | bp_med_recommended_accaha | bp_med_recommended_escesh | bp_med_n_class | bp_med_n_pills | bp_med_combination | bp_med_pills_gteq_2 | bp_med_ace | bp_med_aldo | bp_med_alpha | bp_med_angioten | bp_med_beta | bp_med_central | bp_med_ccb | bp_med_ccb_dh | bp_med_ccb_ndh | bp_med_diur_Ksparing | bp_med_diur_loop | bp_med_diur_thz | bp_med_renin_inhibitors | bp_med_vasod | htn_jnc7 | htn_accaha | htn_escesh | htn_aware | htn_resistant_jnc7 | htn_resistant_accaha | htn_resistant_jnc7_thz | htn_resistant_accaha_thz | chol_measured_never | chol_measured_last | chol_total | chol_total_gteq_200 | chol_total_gteq_240 | chol_hdl | chol_hdl_low | chol_trig | chol_trig_gteq_150 | chol_ldl | chol_ldl_5cat | chol_ldl_lt_70 | chol_ldl_gteq_70 | chol_ldl_lt_100 | chol_ldl_gteq_100 | chol_ldl_gteq_190 | chol_ldl_persistent | chol_nonhdl | chol_nonhdl_5cat | chol_nonhdl_lt_100 | chol_nonhdl_gteq_100 | chol_nonhdl_gteq_220 | chol_med_use | chol_med_use_sr | chol_med_statin | chol_med_ezetimibe | chol_med_pcsk9i | chol_med_bile | chol_med_fibric_acid | chol_med_atorvastatin | chol_med_simvastatin | chol_med_rosuvastatin | chol_med_pravastatin | chol_med_pitavastatin | chol_med_fluvastatin | chol_med_lovastatin | chol_med_other | chol_med_addon_use | chol_med_addon_recommended_ahaacc | chol_med_statin_recommended_ahaacc | chol_med_recommended_ever | ascvd_risk_vh_ahaacc | cc_smoke | cc_bmi | cc_diabetes | cc_ckd | cc_acr | cc_egfr | cc_hba1c | cc_egfr_lt60 | cc_acr_gteq30 | cc_cvd_mi | cc_cvd_chd | cc_cvd_stroke | cc_cvd_ascvd | cc_cvd_hf | cc_cvd_any |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
37436 | 22062.121 | 2 | 47 | 2005-2006 | 1 | 1 | 75+ | Non-Hispanic White | No | 77 | No | Men | 134.0000 | 60.66667 | SBP of 130 to <140 or DBP 80 to <90 mm Hg | taking antihypertensive medications | No | No | Yes | Yes | Yes | No | Yes | Yes | No | No | No | Yes | Yes | Yes | Yes | Yes | Three | Three | No | Yes | Yes | No | Yes | No | No | No | No | No | No | No | Yes | No | No | No | Yes | Yes | Yes | Yes | Yes | Yes | No | No | Cholesterol has been measured previously | In the past year | 136 | No | No | 23 | Yes | 154 | Yes | 84.21057 | 70 to <100 mg/dL | No | Yes | Yes | No | No | No | 113 | 100 to <130 mg/dL | No | Yes | No | Yes | Yes | No | No | No | No | Yes | No | No | No | No | No | No | No | No | No | Yes | Yes | Yes | Yes | Never | 35+ | Yes | Yes | 95.873016 | 59.00152 | 6.6 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
74375 | 12536.217 | 1 | 111 | 2013-2014 | 1 | 0 | 45 to 64 | Hispanic | No | 63 | No | Women | 171.3333 | 72.66667 | SBP 160+ or DBP 100+ mm Hg | taking antihypertensive medications | No | No | No | No | No | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | One | One | No | No | Yes | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes | Yes | Yes | Yes | No | No | No | No | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Former | <25 | Yes | Yes | 141.176471 | 108.03135 | 11.2 | No | Yes | No | No | No | No | No | No |
124796 | 4667.865 | 1 | 167 | 2017-2020 | 1 | 0 | 45 to 64 | Non-Hispanic Black | Yes | 61 | No | Women | 133.1667 | 75.70000 | SBP of 130 to <140 or DBP 80 to <90 mm Hg | taking antihypertensive medications | No | No | Yes | No | Yes | No | Yes | Yes | No | Yes | No | Yes | Yes | Yes | Yes | Yes | Four or more | Three | Yes | Yes | Yes | No | No | No | No | No | Yes | Yes | No | No | No | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Never | 35+ | Yes | No | 4.666667 | 85.04965 | 6.9 | No | No | No | No | No | No | No | No |
56088 | 24882.496 | 2 | 75 | 2009-2010 | 1 | 0 | 18 to 44 | Hispanic | No | 40 | No | Women | 136.6667 | 70.00000 | SBP of 130 to <140 or DBP 80 to <90 mm Hg | taking antihypertensive medications | No | No | Yes | No | Yes | No | Yes | Yes | No | Yes | No | Yes | Yes | Yes | Yes | Yes | Two | Two | No | Yes | Yes | No | No | No | No | No | No | No | No | No | No | Yes | No | No | Yes | Yes | Yes | Yes | No | No | No | No | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Never | 35+ | Yes | No | 4.696356 | 115.83310 | 6.7 | No | No | No | No | No | No | No | No |
74958 | 17328.399 | 1 | 108 | 2013-2014 | 1 | 0 | 75+ | Hispanic | No | 76 | No | Women | 177.3333 | 48.00000 | SBP 160+ or DBP 100+ mm Hg | taking antihypertensive medications | No | No | No | No | No | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Two | Two | No | Yes | No | No | No | No | Yes | Yes | No | No | No | No | No | No | No | No | Yes | Yes | Yes | Yes | No | No | No | No | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Former | 25 to <30 | Yes | No | 9.871795 | 65.38231 | 6.9 | No | No | No | No | No | No | No | No |
89925 | 11090.360 | 1 | 128 | 2015-2016 | 1 | 0 | 65 to 74 | Hispanic | No | 68 | No | Women | 132.0000 | 66.00000 | SBP of 130 to <140 or DBP 80 to <90 mm Hg | taking antihypertensive medications | No | No | Yes | Yes | Yes | No | Yes | Yes | No | No | No | Yes | Yes | Yes | Yes | Yes | Four or more | Four or more | No | Yes | Yes | No | No | No | Yes | No | Yes | Yes | No | No | No | Yes | No | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Never | 35+ | Yes | No | 12.426778 | 79.01981 | 7.5 | No | No | No | No | No | No | No | No |
%>%
subset_nhanes ggplot(aes(x=bp_sys_mean, color=cc_diabetes)) +
geom_histogram(fill="white", alpha=0.5, bins = 12) +
ggtitle("Blood Pressure with Diabetes Histogram")
%>%
subset_nhanesggplot(aes(x=cc_diabetes, y=bp_sys_mean, fill=cc_diabetes)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
+
) ggtitle("Blood Pressure VS Diabetes Status") +
xlab("")
We can use a t-test to evaluate whether or not the two populations (diabetes and non-diabetes) have significantly different distributions of blood pressures:
<- subset_nhanes %>% select(cc_diabetes, bp_sys_mean) %>% drop_na(bp_sys_mean) %>% drop_na(cc_diabetes)
nhanes_sys_diabetes <- t.test(bp_sys_mean ~ cc_diabetes, data = subset_nhanes)
diabetes_test diabetes_test
Welch Two Sample t-test
data: bp_sys_mean by cc_diabetes
t = -5.1725, df = 97.158, p-value = 1.24e-06
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
-17.043266 -7.591128
sample estimates:
mean in group No mean in group Yes
122.2213 134.5385
Here we can see that with a p-value of 0.007, there is a significant difference between the two populations. The diabetes group has a mean blood pressure of about 137, while the non-diabetes group has a mean of about 124.
Now we want to add in noise to the data to simulate making the measurements less accurate. We can achieve this by sampling from a normal distribution centered on 0 and adding the resulting value to the original data measurement. This will mask the patient’s true blood pressure value.
We will experiment with 3 values for “reliability”: the higher the value, the lower the variance of the distribution from which we sample noise, and the closer to the original data the noisy data tends to be.
First, let’s try a value of 0.3:
<- 0.3 ### Set up measurement error with 0.5 Attenuation coef
reliability <- 1/reliability - 1
sigma_u_sq
sigma_u_sq
[1] 2.333333
^0.5 sigma_u_sq
[1] 1.527525
We can see that a low reliability value results in a variance of 2.33 and a standard deviation of about 1.53. Next, let’s increase reliability to 0.5
<- 0.5
reliability <- 1/reliability - 1
sigma_u_sq
sigma_u_sq
[1] 1
^0.5 sigma_u_sq
[1] 1
When we increase the reliability, the variance and standard deviation both decrease to 1. This will ultimately result in a bit less change to the original data.
Finally, let’s look at reliability of 0.7
<- 0.7
reliability <- 1/reliability - 1
sigma_u_sq
sigma_u_sq
[1] 0.4285714
^0.5 sigma_u_sq
[1] 0.6546537
With a variance of 0.43 and standard deviation of 0.65, this reliability value creates the least noise compared to 0.3 and 0.5.
Now, let’s actually transform the data we have and visualize:
set.seed(105)
= nrow(subset_nhanes)
n
<- 0.0005 ### Set up measurement error with 0.5 Attenuation coef
reliability <- 1/reliability - 1
sigma_u_sq $bp_sys_mean_noise_low <- subset_nhanes$bp_sys_mean + rnorm(n, sd=sigma_u_sq^0.5)
subset_nhanes$bp_sys_mean_noise_low <- abs(subset_nhanes$bp_sys_mean_noise_low)
subset_nhanes
<- 0.005
reliability <- 1/reliability - 1
sigma_u_sq $bp_sys_mean_noise_med <- subset_nhanes$bp_sys_mean + rnorm(n, sd=sigma_u_sq^0.5)
subset_nhanes
<- 0.1
reliability <- 1/reliability - 1
sigma_u_sq $bp_sys_mean_noise_high <- subset_nhanes$bp_sys_mean + rnorm(n, sd=sigma_u_sq^0.5) subset_nhanes
Let’s visualize the difference between the raw BP measurements and the measurements with new error added in:
#X axis = Blood Pressure
#Y axis = BP + Noise
#Title = Low, Moderate, High Reliability (0.25), for example
<- ggplot(subset_nhanes, aes(x=bp_sys_mean, y=bp_sys_mean_noise_low)) +
scatterplot geom_point(size=0.5) +
ggtitle("Low Reliability") +
xlab("Blood Pressure") +
ylab("BP + Noise")
+ annotate("segment", x = 75, xend = 200, y = 75, yend = 200,
scatterplot colour = "red")
<- ggplot(subset_nhanes, aes(x=bp_sys_mean, y=bp_sys_mean_noise_med)) +
scatterplot geom_point(size=0.5) +
ggtitle("Medium Reliability") +
xlab("Blood Pressure") +
ylab("BP + Error")
+ annotate("segment", x = 75, xend = 200, y = 75, yend = 200,
scatterplot colour = "red")
<- ggplot(subset_nhanes, aes(x=bp_sys_mean, y=bp_sys_mean_noise_high)) +
scatterplot geom_point(size=0.5) +
ggtitle("High Reliability") +
xlab("Blood Pressure") +
ylab("BP + Error")
+ annotate("segment", x = 75, xend = 200, y = 75, yend = 200,
scatterplot colour = "red")
Comparing the values created by setting reliability to 0.025 and 0.25, we can see that the spread of the scatter plot is much different. The added amount of noise in the high reliability case does not shift the data points very far off from their original positions compared to the low reliability case.
We can also measure the “spread” of the noise by calculating the correlation coefficient. This will give us a numerical value for how linked the two variables are:
In some cases:
Notice that in the low reliability case, so much noise here is added that there are a few data points with a blood pressure value below 0. Data with this much error in it obviously wouldn’t be used in the real world, but for the sake of demonstrating the effectiveness of measurement error correction, we will continue to use this data.
print(paste("Reliability Low Correlation Coefficient:", cor(subset_nhanes$bp_sys_mean, subset_nhanes$bp_sys_mean_noise_low)))
[1] "Reliability Low Correlation Coefficient: 0.371898162992643"
print(paste("Reliability Medium Correlation Coefficient:", cor(subset_nhanes$bp_sys_mean, subset_nhanes$bp_sys_mean_noise_med)))
[1] "Reliability Medium Correlation Coefficient: 0.789073180735782"
print(paste("Reliability High Correlation Coefficient:", cor(subset_nhanes$bp_sys_mean, subset_nhanes$bp_sys_mean_noise_high)))
[1] "Reliability High Correlation Coefficient: 0.987158951684439"
Finally, let’s look again at a t-test to see if the relationship between diabetes status and noisy blood pressure is any different than the non-noisy data.
<- subset_nhanes %>% select(cc_diabetes, bp_sys_mean) %>% drop_na(bp_sys_mean) %>% drop_na(cc_diabetes)
nhanes_sys_diabetes <- t.test(bp_sys_mean_noise_low ~ cc_diabetes, data = subset_nhanes)
diabetes_test diabetes_test
Welch Two Sample t-test
data: bp_sys_mean_noise_low by cc_diabetes
t = -1.8332, df = 101.26, p-value = 0.06971
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
-21.4872072 0.8470414
sample estimates:
mean in group No mean in group Yes
123.0366 133.3566
<- subset_nhanes %>% select(cc_diabetes, bp_sys_mean) %>% drop_na(bp_sys_mean) %>% drop_na(cc_diabetes)
nhanes_sys_diabetes <- t.test(bp_sys_mean_noise_med ~ cc_diabetes, data = subset_nhanes)
diabetes_test diabetes_test
Welch Two Sample t-test
data: bp_sys_mean_noise_med by cc_diabetes
t = -4.3257, df = 105.5, p-value = 3.462e-05
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
-16.502997 -6.129374
sample estimates:
mean in group No mean in group Yes
122.4257 133.7419
<- subset_nhanes %>% select(cc_diabetes, bp_sys_mean) %>% drop_na(bp_sys_mean) %>% drop_na(cc_diabetes)
nhanes_sys_diabetes <- t.test(bp_sys_mean_noise_high ~ cc_diabetes, data = subset_nhanes)
diabetes_test diabetes_test
Welch Two Sample t-test
data: bp_sys_mean_noise_high by cc_diabetes
t = -5.0414, df = 96.81, p-value = 2.152e-06
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
-16.935141 -7.367393
sample estimates:
mean in group No mean in group Yes
122.3786 134.5299
We can see that in the low reliability case, enough noise was added that the results of the t-test are no longer significant, as the p-value is higher than 0.05. In the medium reliability case, the results are still significant, but much less so, with the p-value doubling from 0.02 to 0.04. Finally, in the high reliability case, the results of the t-test are not much different than when using the raw data. This is because the amount of noise added was quite low.
Now, we will try to fit a linear regression model to estimate the raw measurement from the “error”-full measurement.
.3 <- glm(bp_sys_mean ~ bp_sys_mean_noise_low, data = subset_nhanes, family = 'gaussian')
error_model_0summary(error_model_0.3)
Call:
glm(formula = bp_sys_mean ~ bp_sys_mean_noise_low, family = "gaussian",
data = subset_nhanes)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 105.08228 2.04124 51.480 <2e-16 ***
bp_sys_mean_noise_low 0.15067 0.01538 9.797 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 304.6497)
Null deviance: 211422 on 599 degrees of freedom
Residual deviance: 182181 on 598 degrees of freedom
AIC: 5138.2
Number of Fisher Scoring iterations: 2
.5 <- glm(bp_sys_mean ~ bp_sys_mean_noise_med, data = subset_nhanes, family = 'gaussian')
error_model_0summary(error_model_0.5)
Call:
glm(formula = bp_sys_mean ~ bp_sys_mean_noise_med, family = "gaussian",
data = subset_nhanes)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 44.15721 2.57965 17.12 <2e-16 ***
bp_sys_mean_noise_med 0.64300 0.02047 31.41 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 133.4162)
Null deviance: 211422 on 599 degrees of freedom
Residual deviance: 79783 on 598 degrees of freedom
AIC: 4642.8
Number of Fisher Scoring iterations: 2
.7 <- glm(bp_sys_mean ~ bp_sys_mean_noise_high, data = subset_nhanes, family = 'gaussian')
error_model_0summary(error_model_0.7)
Call:
glm(formula = bp_sys_mean ~ bp_sys_mean_noise_high, family = "gaussian",
data = subset_nhanes)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.0698 0.8149 2.54 0.0113 *
bp_sys_mean_noise_high 0.9822 0.0065 151.12 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 9.021565)
Null deviance: 211421.9 on 599 degrees of freedom
Residual deviance: 5394.9 on 598 degrees of freedom
AIC: 3026.5
Number of Fisher Scoring iterations: 2
%>%
subset_nhanes ggplot(aes(x=bp_sys_mean_noise_low, color=cc_diabetes)) +
geom_histogram(fill="white", alpha=0.5, bins = 12) +
ggtitle("Blood Pressure with Diabetes Histogram")
%>%
subset_nhanes ggplot(aes(x=cc_diabetes, y=bp_sys_mean_noise_low, fill=cc_diabetes)) +
geom_boxplot() +
scale_fill_viridis(discrete = TRUE, alpha=0.6) +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
+
) ggtitle("Blood Pressure VS Diabetes Status") +
xlab("")
We can see in both the histogram and especially in the box plot, the noise has made it so that the two group’s distributions are virtually indistinguishable. This mimics what may happen in the real world: although the underlying distribution of two groups may be different, error in measurement may mask this fact so that the data given looks very similar. Had we been given the noisy data and performed a t-test without accounting for this error, we would come to the incorrect conclusion that diabetes and blood pressure are not linked.
This is the main issue that measurement error seeks to correct. By using it, we can avoid drawing incorrect conclusions about our data.
#Store our dataframe:
saveRDS(subset_nhanes, "nhanes_subset.rds")