Regresión y Correlación Ejemplos
library(tidyverse)
Ejemplos resueltos de Permutación y Correlación, tomados de los problemas del capítulo 9 del Daniel.
Ejemplo Dean Parmalee wished to know if the year-end grades assigned to Wright State University Medical School students are predictive of their second-year board scores. The following table shows, for 89 students, the year-end score (AVG, in percent of 100) and the score on the second-year medical board examination (BOARD). Table in REV_C09_17.csv. Perform a complete regression analysis with AVG as the independent variable. Let α = .05 for all tests.
Respuesta Describir los datos. Correr el modelo lineal para Board ~ AVG con \(\alpha = 0.05\).
<- read_csv("DataSets/ch09_all/REV_C09_17.csv", show_col_types = FALSE)
Scores Scores
# A tibble: 89 × 2
AVG BOARD
<dbl> <dbl>
1 95.7 257
2 94.0 256
3 91.5 242
4 91.5 223
5 91.1 241
6 90.9 234
7 90.8 226
8 90.6 236
9 90.3 250
10 90.3 226
# … with 79 more rows
ggplot(Scores, aes(x = AVG, y = BOARD)) +
geom_point() +
geom_hline(aes(yintercept = mean(BOARD)), linetype = "dashed", color = "red") +
geom_vline(aes(xintercept = mean(AVG)), linetype = "dashed", color = "red")
<- lm(BOARD ~ AVG, data = Scores)
Scores_lm summary(Scores_lm)
Call:
lm(formula = BOARD ~ AVG, data = Scores)
Residuals:
Min 1Q Median 3Q Max
-28.931 -8.150 2.397 7.193 39.441
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -191.0296 22.9342 -8.329 1.06e-12 ***
AVG 4.6815 0.2727 17.169 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.49 on 87 degrees of freedom
Multiple R-squared: 0.7721, Adjusted R-squared: 0.7695
F-statistic: 294.8 on 1 and 87 DF, p-value: < 2.2e-16
confint(Scores_lm)
2.5 % 97.5 %
(Intercept) -236.613708 -145.445446
AVG 4.139525 5.223459
ggplot(Scores, aes(x = AVG, y = BOARD)) +
geom_point() +
geom_hline(aes(yintercept = mean(BOARD)), linetype = "dashed", color = "red") +
geom_vline(aes(xintercept = mean(AVG)), linetype = "dashed", color = "red") +
geom_abline(aes(intercept = Scores_lm$coefficients[1], slope = Scores_lm$coefficients[2]), color = "blue")
plot(Scores_lm)
Ejemplo The data in REV_C09_25.csv were collected during an experiment in which laboratory animals were inoculated with a pathogen. The variables are time in hours after inocuation and temperature in degrees Celsius. Find the simple linear regression equation and test \(H_0 : \beta_2 = 0\). Also test \(H_0 : \rho = 0\) and construct a 95% confidence interval for \(\rho\). Construct the 95% prediction interval for the temperature at 50 hours after inoculation. Let \(\alpha = .05\) for all the tests.
Respuesta Vamos a estimar el modelo lineal lm
para determinar que podemos desechar la hipótesis nula \(H_0 : \beta_2 = 0\), es decir \(H_a : \beta_2 \ne 0\) es decir que si existe el modelo y la pendiente \(\beta_2\) es diferente de cero. Para hacer la prueba de hipótesis de \(\rho\) estimaremos la prueba de correlación a un nivel del .05.
<- read_csv("DataSets/ch09_all/REV_C09_25.csv", show_col_types = FALSE)
Temp Temp
# A tibble: 10 × 2
TIME TEMP
<dbl> <dbl>
1 24 38.8
2 28 39.5
3 32 40.3
4 36 40.7
5 40 41
6 44 41.1
7 48 41.4
8 52 41.6
9 56 41.8
10 60 41.9
ggplot(Temp, aes(x = TIME, y = TEMP)) +
geom_point(color = "green") +
geom_hline(aes(yintercept = mean(TEMP)), linetype = "dashed", color = "red") +
geom_vline(aes(xintercept = mean(TIME)), linetype = "dashed", color = "red")
<- lm(TEMP ~ TIME, data = Temp)
Temp_lm summary(Temp_lm)
Call:
lm(formula = TEMP ~ TIME, data = Temp)
Residuals:
Min 1Q Median 3Q Max
-0.57273 -0.17606 0.05121 0.24894 0.36909
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.456364 0.395901 94.610 1.74e-13 ***
TIME 0.079848 0.009092 8.782 2.22e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3303 on 8 degrees of freedom
Multiple R-squared: 0.906, Adjusted R-squared: 0.8943
F-statistic: 77.13 on 1 and 8 DF, p-value: 2.218e-05
# Para estimar los intervalos de confianza de los parámetros estimados del modelo
confint(Temp_lm, level = 0.95)
2.5 % 97.5 %
(Intercept) 36.5434139 38.3693134
TIME 0.0588819 0.1008151
# Para estimar una respuesta del modelo en un punto dado, TIME 50 hrs, y su intervalo de confianza al 95%
<- data.frame(TIME = 50)
new_TIME predict(Temp_lm, newdata = new_TIME, interval = "confidence", level = 0.95)
fit lwr upr
1 41.44879 41.15526 41.74232
ggplot(Temp, aes(x = TIME, y = TEMP)) +
geom_point(color = "green") +
geom_hline(aes(yintercept = mean(TEMP)), linetype = "dashed", color = "red") +
geom_vline(aes(xintercept = mean(TIME)), linetype = "dashed", color = "red") +
geom_smooth(method = lm, color = "blue")
# test Ho : rho = 0 and construct a 95% confidence interval for rho.
cor.test(Temp$TIME, Temp$TEMP, level = .95)
Pearson's product-moment correlation
data: Temp$TIME and Temp$TEMP
t = 8.7821, df = 8, p-value = 2.218e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8041770 0.9888496
sample estimates:
cor
0.9518514
Ejemplo Reiss et al. compared point-of-care and standard hospital laboratory assays for monitoring patients receiving a single anticoagulant or a regimen consisting of a combination of anticoagulants. It is quite common when comparing two measuring techniques, to use regression analysis in which one variable is used to predict another. In the present study, the researchers obtained measures of international normalized ratio (INR) by assay of capillary and venous blood samples collected from 90 subjects taking warfarin. INR, used especially when patients are receiving warfarin, measures the clotting ability of the blood. Point-of-care testing for INR was conducted with the CoaguChek assay product. Hospital testing was done with standard hospital laboratory assays. The authors used the hospital assay INR level to predict the CoaguChek INR level. The measurements are given in EXR_C09_S03_04.csv.
Respuesta Nos piden desarrollar un modelo lineal para probar las medidas de CoaguCheck (variable respuesta) en función de las medidas del laboratorio del hospital (predictores para INR). Y probar si podemos desechar la hipótesis nula \(H_0 : \beta_2 =0\) para el modelo Y(CoaguCheck) ~ X(Hospital)
. Estimaremos los parámetros del modelo y calcularemos los intervalos de confianza al 95% de los parámetros estimados. La gráfica del modelo lineal incluye el intervalo de error del modelo ajustado.
<- read_csv("DataSets/ch09_all/EXR_C09_S03_04.csv", show_col_types = FALSE)
INR INR
# A tibble: 90 × 2
Y X
<dbl> <dbl>
1 1.8 1.6
2 1.6 1.9
3 2.5 2.8
4 1.9 2.4
5 1.3 1.5
6 2.3 1.8
7 1.2 1.3
8 2.3 2.4
9 2 2.1
10 1.5 1.5
# … with 80 more rows
<- INR %>%
INR rename( CoaguCheck = Y, Hospital = X)
ggplot(INR, aes(x = Hospital, y = CoaguCheck)) +
geom_point(color = "green") +
geom_hline(aes(yintercept = mean(CoaguCheck)), linetype = "dashed", color = "red") +
geom_vline(aes(xintercept = mean(Hospital)), linetype = "dashed", color = "red")
<- lm(CoaguCheck ~ Hospital, data = INR)
INR_lm summary(INR_lm)
Call:
lm(formula = CoaguCheck ~ Hospital, data = INR)
Residuals:
Min 1Q Median 3Q Max
-2.7248 -0.3357 -0.1341 0.1306 2.0040
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.48848 0.18167 2.689 0.00858 **
Hospital 0.86251 0.08972 9.613 2.24e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.64 on 88 degrees of freedom
Multiple R-squared: 0.5122, Adjusted R-squared: 0.5067
F-statistic: 92.41 on 1 and 88 DF, p-value: 2.244e-15
confint(INR_lm, level = 0.95)
2.5 % 97.5 %
(Intercept) 0.1274485 0.8495109
Hospital 0.6842054 1.0408162
ggplot(INR, aes(x = Hospital, y = CoaguCheck)) +
geom_point(color = "green") +
geom_smooth(method = lm, color = "blue")
EjemploAnother variable of interest in the study by Reiss et al. (see last example) was partial thromboplastin (aPTT), the standard test used to monitor heparin anticoagulation. Use the data in the following table to examine the correlation between aPTT levels as measured by the CoaguCheck point-of-care assay and standard laboratory hospital assay in 90 subjects receiving heparin alone, heparin with warfarin, and warfarin and exoenoxaparin table EXR_C09_S07_02.csv.
Respuesta Para probar si es posible desechar la hipótesis nula \(H_0 : \rho = 0\) se corre la prueba de correlación de los datos de aPTT.
<- read_csv("DataSets/ch09_all/EXR_C09_S07_02.csv", show_col_types = FALSE)
aPTT aPTT
# A tibble: 90 × 2
COAGU HOSP
<dbl> <dbl>
1 49.3 71.4
2 57.9 86.4
3 59 75.6
4 77.3 54.5
5 42.3 57.7
6 44.3 59.5
7 90 77.2
8 55.4 63.3
9 20.3 27.6
10 28.7 52.6
# … with 80 more rows
cor.test(aPTT$HOSP, aPTT$COAGU, method = "pearson", conf.level = 0.95)
Pearson's product-moment correlation
data: aPTT$HOSP and aPTT$COAGU
t = 10.17, df = 88, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6227350 0.8176615
sample estimates:
cor
0.735034