Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. This makes it important for health facilities to be able to predict occurrences of heart diseases in order to make plans for early treatment (Assmann et al. 2005). Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.
People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.
This dataset contains 918 observations with 12 features. The data dictionary is as follows:
#Import relevant packages
library(tidyverse)
library(janitor)
#library(readr)
library(plotly)
library(knitr)
heart_disease <- read_csv('https://raw.githubusercontent.com/reinpmomz/Predicting-heart-disease/main/data/heart.csv')%>%
clean_names()
#View(heart_disease)
head(heart_disease)
## # A tibble: 6 x 12
## age sex cp trtbps chol fbs restecg thalach exng oldpeak slp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 40 1 1 140 289 0 0 172 0 0 0
## 2 49 0 2 160 180 0 0 156 0 1 1
## 3 37 1 1 130 283 0 1 98 0 0 0
## 4 48 0 3 138 214 0 0 108 1 1.5 1
## 5 54 1 2 150 195 0 0 122 0 0 0
## 6 39 1 2 120 339 0 0 170 0 0 0
## # ... with 1 more variable: output <dbl>
##checking the data structure
str(heart_disease)
## spc_tbl_ [918 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:918] 40 49 37 48 54 39 45 54 37 48 ...
## $ sex : num [1:918] 1 0 1 0 1 1 0 1 1 0 ...
## $ cp : num [1:918] 1 2 1 3 2 2 1 1 3 1 ...
## $ trtbps : num [1:918] 140 160 130 138 150 120 130 110 140 120 ...
## $ chol : num [1:918] 289 180 283 214 195 339 237 208 207 284 ...
## $ fbs : num [1:918] 0 0 0 0 0 0 0 0 0 0 ...
## $ restecg: num [1:918] 0 0 1 0 0 0 0 0 0 0 ...
## $ thalach: num [1:918] 172 156 98 108 122 170 170 142 130 120 ...
## $ exng : num [1:918] 0 0 0 1 0 0 0 0 1 0 ...
## $ oldpeak: num [1:918] 0 1 0 1.5 0 0 0 0 1.5 0 ...
## $ slp : num [1:918] 0 1 0 1 0 0 0 0 1 0 ...
## $ output : num [1:918] 0 1 0 1 0 0 0 0 1 0 ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. sex = col_double(),
## .. cp = col_double(),
## .. trtbps = col_double(),
## .. chol = col_double(),
## .. fbs = col_double(),
## .. restecg = col_double(),
## .. thalach = col_double(),
## .. exng = col_double(),
## .. oldpeak = col_double(),
## .. slp = col_double(),
## .. output = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
From the output on the data structure, all of the data has been read as numeric values(‘double’ value or a decimal type with at least two decimal places) but some should be converted to factors since they are categorical.
heart_disease_final <- heart_disease%>%
mutate(output= factor(output, levels = c(0,1),
labels = c("No heart attack", "Has Heart attack")))%>%
mutate(sex= factor(sex, levels = c(0,1),
labels = c("Female", "Male")))%>%
mutate(cp= factor(cp, levels = c(0,1,2,3),
labels = c("typical angina", "atypical angina", "non-anginal", "asymptomatic")))%>%
mutate(fbs= factor(fbs, levels = c(0,1),
labels = c("<= 120mg/dl", ">120mg/dl")))%>%
mutate(restecg= factor(restecg, levels = c(0,1,2),
labels = c("normal", "having ST_T \n abnormality", "showing probable or \n definate left\n ventricular hypertrophy")))%>%
mutate(exng= factor(exng, levels = c(0,1),
labels = c("No", "Yes")))%>%
mutate(slp= factor(slp, levels = c(0,1,2),
labels = c("upsloping", "flat", "downsloping")))%>%
filter(trtbps>0)%>%
filter(chol>0)
var.labels = c(
age = "Age of the patient",
sex = "Sex of the patient",
cp = "Chest pain type",
trtbps = "Resting blood pressure (in mm Hg)",
chol = "Cholestoral in mg/dl",
fbs = "Fasting blood sugar",
restecg = "Resting electrocardiographic results",
thalach = "Maximum heart rate achieved",
exng = "Exercise induced angina",
oldpeak = "ST depression",
slp = "Slope of peak exercise",
output = "Heart disease")
heart_disease_final <- labelled::set_variable_labels(heart_disease_final, .labels = var.labels)
str(heart_disease_final)
## tibble [746 x 12] (S3: tbl_df/tbl/data.frame)
## $ age : num [1:746] 40 49 37 48 54 39 45 54 37 48 ...
## ..- attr(*, "label")= chr "Age of the patient"
## $ sex : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 1 2 2 1 ...
## ..- attr(*, "label")= chr "Sex of the patient"
## $ cp : Factor w/ 4 levels "typical angina",..: 2 3 2 4 3 3 2 2 4 2 ...
## ..- attr(*, "label")= chr "Chest pain type"
## $ trtbps : num [1:746] 140 160 130 138 150 120 130 110 140 120 ...
## ..- attr(*, "label")= chr "Resting blood pressure (in mm Hg)"
## $ chol : num [1:746] 289 180 283 214 195 339 237 208 207 284 ...
## ..- attr(*, "label")= chr "Cholestoral in mg/dl"
## $ fbs : Factor w/ 2 levels "<= 120mg/dl",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "Fasting blood sugar"
## $ restecg: Factor w/ 3 levels "normal","having ST_T \n abnormality",..: 1 1 2 1 1 1 1 1 1 1 ...
## ..- attr(*, "label")= chr "Resting electrocardiographic results"
## $ thalach: num [1:746] 172 156 98 108 122 170 170 142 130 120 ...
## ..- attr(*, "label")= chr "Maximum heart rate achieved"
## $ exng : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
## ..- attr(*, "label")= chr "Exercise induced angina"
## $ oldpeak: num [1:746] 0 1 0 1.5 0 0 0 0 1.5 0 ...
## ..- attr(*, "label")= chr "ST depression"
## $ slp : Factor w/ 3 levels "upsloping","flat",..: 1 2 1 2 1 1 1 1 2 1 ...
## ..- attr(*, "label")= chr "Slope of peak exercise"
## $ output : Factor w/ 2 levels "No heart attack",..: 1 2 1 2 1 1 1 1 2 1 ...
## ..- attr(*, "label")= chr "Heart disease"
The categorial columns are sex, cp, fbs, restecg, exng, slp while continuous columns are age, trtbps, chol, thalach, oldpeak and target variable is output
mean(is.na(heart_disease_final))
## [1] 0
#which(is.na(heart_disease_final))
#which(!complete.cases(heart_disease_final))
There were no missing values in our data set.
Exploring data is vital in giving a clue of the expected relationship between the dependent variable and the exploratory variables (Zuur et al. 2010).
This is analysis of one variable to enable us understand the distribution of values for a single variable.
library(gtsummary)
library(flextable)
set_gtsummary_theme(list(
"tbl_summary-fn:percent_fun" = function(x) style_percent(x, digits = 1),
"tbl_summary-str:categorical_stat" = "{n} ({p}%)"
))
# Setting `Compact` theme
theme_gtsummary_compact()
# make dataset with variables to summarize
tbl_summary(heart_disease_final,
type = list(
all_dichotomous() ~ "categorical",
all_continuous() ~ "continuous2")
, statistic = all_continuous() ~ c(
"{mean} ({sd})",
"{median} ({p25}, {p75})",
"{min}, {max}")
, digits = all_continuous() ~ 2
, missing = "always" # don't list missing data separately
,missing_text = "Missing"
) %>%
modify_header(label = "**Descriptives**") %>% # update the column header
bold_labels() %>%
italicize_levels()%>%
add_n() # add column with total number of non-missing observations
Descriptives | N | N = 7461 |
---|---|---|
Age of the patient | 746 | |
Mean (SD) | 52.88 (9.51) | |
Median (IQR) | 54.00 (46.00, 59.00) | |
Range | 28.00, 77.00 | |
Missing | 0 | |
Sex of the patient | 746 | |
Female | 182 (24.4%) | |
Male | 564 (75.6%) | |
Missing | 0 | |
Chest pain type | 746 | |
typical angina | 41 (5.50%) | |
atypical angina | 166 (22.3%) | |
non-anginal | 169 (22.7%) | |
asymptomatic | 370 (49.6%) | |
Missing | 0 | |
Resting blood pressure (in mm Hg) | 746 | |
Mean (SD) | 133.02 (17.28) | |
Median (IQR) | 130.00 (120.00, 140.00) | |
Range | 92.00, 200.00 | |
Missing | 0 | |
Cholestoral in mg/dl | 746 | |
Mean (SD) | 244.64 (59.15) | |
Median (IQR) | 237.00 (207.25, 275.00) | |
Range | 85.00, 603.00 | |
Missing | 0 | |
Fasting blood sugar | 746 | |
<= 120mg/dl | 621 (83.2%) | |
>120mg/dl | 125 (16.8%) | |
Missing | 0 | |
Resting electrocardiographic results | 746 | |
normal | 445 (59.7%) | |
having ST_T abnormality | 125 (16.8%) | |
showing probable or definate left ventricular hypertrophy | 176 (23.6%) | |
Missing | 0 | |
Maximum heart rate achieved | 746 | |
Mean (SD) | 140.23 (24.52) | |
Median (IQR) | 140.00 (122.00, 160.00) | |
Range | 69.00, 202.00 | |
Missing | 0 | |
Exercise induced angina | 746 | |
No | 459 (61.5%) | |
Yes | 287 (38.5%) | |
Missing | 0 | |
ST depression | 746 | |
Mean (SD) | 0.90 (1.07) | |
Median (IQR) | 0.50 (0.00, 1.50) | |
Range | -0.10, 6.20 | |
Missing | 0 | |
Slope of peak exercise | 746 | |
upsloping | 349 (46.8%) | |
flat | 354 (47.5%) | |
downsloping | 43 (5.76%) | |
Missing | 0 | |
Heart disease | 746 | |
No heart attack | 390 (52.3%) | |
Has Heart attack | 356 (47.7%) | |
Missing | 0 | |
1 n (%) |
library(ggpubr)
p1 <- ggplot(heart_disease_final, aes(x=output))+
geom_bar(aes(fill = output), show.legend = FALSE)+
labs(x="",y="", title = "Heart disease")+
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
#guides(fill = FALSE)+
theme_pubclean()+
theme(axis.title = element_text(face="bold",color="black",size=13),
#legend.position = "none",
axis.text.y = element_text(color="black",size=12),
axis.text.x = element_text(color="black",size=11),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.major.x = element_blank())
p2 <- ggplot(heart_disease_final, aes(x=sex))+
geom_bar(aes(fill = sex), show.legend = FALSE)+
labs(x="",y="", title = "Sex of the patient")+
scale_y_continuous(breaks = seq(0, 600, by = 100), limits = c(0, 600))+
theme_pubclean()+
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
theme(axis.title = element_text(face="bold",color="black",size=13),
axis.text = element_text(color="black",size=12),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.major.x = element_blank())
p3 <- ggplot(heart_disease_final, aes(x=cp))+
geom_bar(aes(fill = cp), show.legend = FALSE)+
labs(x="",y="", title = "Chest pain Type")+
scale_y_continuous(breaks = seq(0, 400, by = 50), limits = c(0, 400))+
theme_pubclean()+
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
theme(axis.title = element_text(face="bold",color="black",size=13),
axis.text.y = element_text(color="black",size=12),
axis.text.x = element_text(color="black",size=12, angle = 90),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.major.x = element_blank())
p4 <- ggplot(heart_disease_final, aes(x=fbs))+
geom_bar(aes(fill = fbs), show.legend = FALSE)+
labs(x="",y="", title = "Fasting blood sugar")+
scale_y_continuous(breaks = seq(0, 750, by = 100), limits = c(0, 750))+
theme_pubclean()+
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
theme(axis.title= element_text(face="bold",color="black",size=13),
axis.text = element_text(color="black",size=12),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.major.x = element_blank())
p5 <- ggplot(heart_disease_final, aes(x=restecg))+
geom_bar(aes(fill = restecg), show.legend = FALSE)+
labs(x="",y="", title = "Resting electrocardiographic results")+
scale_y_continuous(breaks = seq(0, 500, by = 100), limits = c(0, 500))+
theme_pubclean()+
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
theme(axis.title = element_text(face="bold",color="black",size=13),
axis.text.y = element_text(color="black",size=12),
axis.text.x = element_text(color="black",size=11, angle = 90),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.major.x = element_blank())
p6 <- ggplot(heart_disease_final, aes(x=exng))+
geom_bar(aes(fill = exng), show.legend = FALSE)+
labs(x="",y="", title = "Exercise induced angina")+
scale_y_continuous(breaks = seq(0, 500, by = 100), limits = c(0, 500))+
theme_pubclean()+
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
theme(axis.title = element_text(face="bold",color="black",size=13),
axis.text = element_text(color="black",size=12),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.major.x = element_blank())
p7 <- ggplot(heart_disease_final, aes(x=slp))+
geom_bar(aes(fill = slp), show.legend = FALSE)+
labs(x="",y="", title = "Slope of peak exercise")+
scale_y_continuous(breaks = seq(0, 400, by = 50), limits = c(0, 400))+
theme_pubclean()+
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
theme(axis.title = element_text(face="bold",color="black",size=13),
axis.text = element_text(color="black",size=12),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.major.x = element_blank())
figure1 <- ggarrange(p1, p2, p3, p4, p5, p6, p7,
#labels = c("A", "B", "C","D", "E", "F", "G"),
ncol = 4, nrow = 2)
annotate_figure(figure1,
top = text_grob("Count plots for various categorical features",
color = "red", face = "bold", size = 15),
bottom = text_grob("Data source: \n heart disease data set", color = "blue",
hjust = 1, x = 0.98, face = "italic", size = 10),
#left = text_grob("Figure arranged using ggpubr", color = "green", rot = 90),
#right = "",
fig.lab = "Figure 1", fig.lab.face = "bold"
)
p8 <- ggplot(heart_disease_final, aes(x=age)) +
geom_histogram(color="black", fill="white")+
geom_vline(aes(xintercept=mean(age)),
color="blue", linetype="dashed", size=1)+
scale_y_continuous(breaks = seq(0, 85, by = 10), limits = c(0, 80))+
scale_x_continuous(n.breaks = 10)+
labs(x="age",y="count", title = "Age of the patient")+
theme(axis.title = element_text(color="black",size=12),
axis.text = element_text(color="black",size=11),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())
p9 <-ggplot(heart_disease_final, aes(x=trtbps)) +
geom_histogram(color="black", fill="lightblue")+
geom_vline(aes(xintercept=mean(trtbps)),
color="blue", linetype="dashed", size=1)+
scale_y_continuous(breaks = seq(0, 130, by = 10), limits = c(0, 130))+
scale_x_continuous(n.breaks = 10)+
labs(x="Resting blood pressure (in mm Hg)",y="count", title = "Resting blood pressure")+
theme(axis.title = element_text(color="black",size=12),
axis.text = element_text(color="black",size=11),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())
p10 <-ggplot(heart_disease_final, aes(x=chol)) +
geom_histogram(color="black", fill="pink")+
geom_vline(aes(xintercept=mean(chol)),
color="blue", linetype="dashed", size=1)+
scale_y_continuous(breaks = seq(0, 120, by = 10), limits = c(0, 120))+
scale_x_continuous(n.breaks = 12)+
labs(x="cholestoral in mg/dl",y="count", title = "cholestoral")+
theme(axis.title = element_text(color="black",size=12),
axis.text = element_text(color="black",size=11),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())
p11 <-ggplot(heart_disease_final, aes(x=thalach)) +
geom_histogram(color="black", fill="lightgreen")+
geom_vline(aes(xintercept=mean(thalach)),
color="blue", linetype="dashed", size=1)+
scale_y_continuous(breaks = seq(0, 80, by = 10), limits = c(0, 80))+
scale_x_continuous(n.breaks = 12)+
labs(x="Maximum heart rate achieved",y="count", title = "Maximum heart rate")+
theme(axis.title = element_text(color="black",size=12),
axis.text = element_text(color="black",size=11),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())
figure2 <- ggarrange(p8, p9, p10, p11,
ncol = 2, nrow = 2)
annotate_figure(figure2,
top = text_grob("Histograms for various continous features",
color = "red", face = "bold", size = 15),
bottom = text_grob("Data source: \n heart disease data set", color = "blue",
hjust = 1, x = 0.98, face = "italic", size = 10),
#left = text_grob("Figure arranged using ggpubr", color = "green", rot = 90),
#right = "",
fig.lab = "Figure 2", fig.lab.face = "bold"
)
# make dataset with variables to summarize
tbl_summary(heart_disease_final,
by = output,
type = list(
all_dichotomous() ~ "categorical",
all_continuous() ~ "continuous2")
, statistic = all_continuous() ~ c(
"{mean} ({sd})",
"{median} ({p25}, {p75})",
"{min}, {max}")
, digits = all_continuous() ~ 2
, missing = "always" # don't list missing data separately
,missing_text = "Missing"
) %>%
modify_header(label = "**Variables**") %>% # update the column header
bold_labels() %>%
italicize_levels()%>%
add_n()%>% # add column with total number of non-missing observations
add_p(pvalue_fun = ~style_pvalue(.x, digits = 3),
c(age, trtbps, chol, thalach, oldpeak) ~ "t.test",
test.args = c(age, trtbps, chol, thalach, oldpeak) ~ list(var.equal = TRUE)) %>%
bold_p(t= 0.05) %>% # bold p-values under a given threshold (default 0.05)
#add_overall() %>%
#add_difference() %>% #add column for difference between two group, confidence interval, and p-value
modify_spanning_header(c("stat_1", "stat_2") ~ "**Heart Disease**") %>%
#modify_caption("**Table 1. Patient Characteristics**")%>%
modify_footnote(
all_stat_cols() ~ "Mean (SD); Median (IQR); Range; Frequency (%)"
)
Variables | N | Heart Disease | p-value2 | |
---|---|---|---|---|
No heart attack, N = 3901 | Has Heart attack, N = 3561 | |||
Age of the patient | 746 | <0.001 | ||
Mean (SD) | 50.17 (9.31) | 55.85 (8.82) | ||
Median (IQR) | 51.00 (43.00, 56.00) | 57.00 (50.00, 62.00) | ||
Range | 28.00, 76.00 | 31.00, 77.00 | ||
Missing | 0 | 0 | ||
Sex of the patient | 746 | <0.001 | ||
Female | 142 (36.4%) | 40 (11.2%) | ||
Male | 248 (63.6%) | 316 (88.8%) | ||
Missing | 0 | 0 | ||
Chest pain type | 746 | <0.001 | ||
typical angina | 26 (6.67%) | 15 (4.21%) | ||
atypical angina | 145 (37.2%) | 21 (5.90%) | ||
non-anginal | 123 (31.5%) | 46 (12.9%) | ||
asymptomatic | 96 (24.6%) | 274 (77.0%) | ||
Missing | 0 | 0 | ||
Resting blood pressure (in mm Hg) | 746 | <0.001 | ||
Mean (SD) | 130.16 (16.14) | 136.15 (17.96) | ||
Median (IQR) | 130.00 (120.00, 140.00) | 136.00 (123.75, 145.00) | ||
Range | 94.00, 190.00 | 92.00, 200.00 | ||
Missing | 0 | 0 | ||
Cholestoral in mg/dl | 746 | 0.005 | ||
Mean (SD) | 238.77 (55.39) | 251.06 (62.46) | ||
Median (IQR) | 231.50 (203.00, 269.00) | 246.00 (212.00, 283.25) | ||
Range | 85.00, 564.00 | 100.00, 603.00 | ||
Missing | 0 | 0 | ||
Fasting blood sugar | 746 | <0.001 | ||
<= 120mg/dl | 347 (89.0%) | 274 (77.0%) | ||
>120mg/dl | 43 (11.0%) | 82 (23.0%) | ||
Missing | 0 | 0 | ||
Resting electrocardiographic results | 746 | <0.001 | ||
normal | 257 (65.9%) | 188 (52.8%) | ||
having ST_T abnormality | 52 (13.3%) | 73 (20.5%) | ||
showing probable or definate left ventricular hypertrophy | 81 (20.8%) | 95 (26.7%) | ||
Missing | 0 | 0 | ||
Maximum heart rate achieved | 746 | <0.001 | ||
Mean (SD) | 149.06 (23.11) | 130.55 (22.30) | ||
Median (IQR) | 150.50 (135.00, 167.00) | 130.00 (115.00, 147.00) | ||
Range | 69.00, 202.00 | 71.00, 195.00 | ||
Missing | 0 | 0 | ||
Exercise induced angina | 746 | <0.001 | ||
No | 340 (87.2%) | 119 (33.4%) | ||
Yes | 50 (12.8%) | 237 (66.6%) | ||
Missing | 0 | 0 | ||
ST depression | 746 | <0.001 | ||
Mean (SD) | 0.39 (0.68) | 1.46 (1.15) | ||
Median (IQR) | 0.00 (0.00, 0.60) | 1.50 (0.50, 2.00) | ||
Range | -0.10, 4.20 | 0.00, 6.20 | ||
Missing | 0 | 0 | ||
Slope of peak exercise | 746 | <0.001 | ||
upsloping | 304 (77.9%) | 45 (12.6%) | ||
flat | 75 (19.2%) | 279 (78.4%) | ||
downsloping | 11 (2.82%) | 32 (8.99%) | ||
Missing | 0 | 0 | ||
1 Mean (SD); Median (IQR); Range; Frequency (%) | ||||
2 Two Sample t-test; Pearson's Chi-squared test |
p12 <- ggplot(heart_disease_final, aes(x=output))+
geom_bar(aes(fill = sex), show.legend = TRUE, position="dodge")+
labs(x="",y="", title = "Sex of the patient", fill="")+
scale_y_continuous(breaks = seq(0, 350, by = 50), limits = c(0, 350))+
theme_pubclean()+
geom_text(aes(fill=sex, label = ..count..), stat = "count", vjust = 1.2, colour = "black",
position = position_dodge(.9))+
theme(axis.title = element_text(face="bold",color="black",size=13),
legend.position = "top",
axis.text = element_text(color="black",size=12),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.major.x = element_blank())
p13 <- ggplot(heart_disease_final, aes(x=output))+
geom_bar(aes(fill = cp), show.legend = TRUE, position="dodge")+
labs(x="",y="", title = "Chest pain Type", fill="")+
scale_y_continuous(breaks = seq(0, 300, by = 50), limits = c(0, 300))+
theme_pubclean()+
geom_text(aes(fill=cp, label = ..count..), stat = "count", vjust = 0.5, colour = "black",
position = position_dodge(.9))+
theme(axis.title = element_text(face="bold",color="black",size=13),
legend.position = "top",
axis.text.y = element_text(color="black",size=12),
axis.text.x = element_text(color="black",size=12),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.major.x = element_blank())+
guides(fill = guide_legend(nrow = 2))
p14 <- ggplot(heart_disease_final, aes(x=output))+
geom_bar(aes(fill = fbs), show.legend = TRUE, position="dodge")+
labs(x="",y="", title = "Fasting blood sugar", fill="")+
scale_y_continuous(breaks = seq(0, 350, by = 50), limits = c(0, 350))+
theme_pubclean()+
geom_text(aes(fill=fbs, label = ..count..), stat = "count", vjust = 1.2, colour = "black",
position = position_dodge(.9))+
theme(axis.title= element_text(face="bold",color="black",size=13),
legend.position = "top",
axis.text = element_text(color="black",size=12),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.major.x = element_blank())
p15 <- ggplot(heart_disease_final, aes(x=output))+
geom_bar(aes(fill = restecg), show.legend = TRUE, position="dodge")+
labs(x="",y="", title = "Resting electrocardiographic results", fill="")+
scale_y_continuous(breaks = seq(0, 300, by = 50), limits = c(0, 300))+
theme_pubclean()+
geom_text(aes(fill=restecg, label = ..count..), stat = "count", vjust = 1.2, colour = "black",
position = position_dodge(.9))+
theme(axis.title = element_text(face="bold",color="black",size=13),
legend.position = "top",
axis.text.y = element_text(color="black",size=12),
axis.text.x = element_text(color="black",size=12),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.major.x = element_blank())
p16 <- ggplot(heart_disease_final, aes(x=output))+
geom_bar(aes(fill = exng), show.legend = TRUE, position="dodge")+
labs(x="",y="", title = "Exercise induced angina", fill="")+
scale_y_continuous(breaks = seq(0, 350, by = 50), limits = c(0, 350))+
theme_pubclean()+
geom_text(aes(fill=exng, label = ..count..), stat = "count", vjust = 1.2, colour = "black",
position = position_dodge(.9))+
theme(axis.title = element_text(face="bold",color="black",size=13),
legend.position = "top",
axis.text = element_text(color="black",size=12),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.major.x = element_blank())
p17 <- ggplot(heart_disease_final, aes(x=output))+
geom_bar(aes(fill = slp), show.legend = TRUE, position="dodge")+
labs(x="",y="", title = "Slope of peak exercise", fill="")+
scale_y_continuous(breaks = seq(0, 350, by = 50), limits = c(0, 350))+
theme_pubclean()+
geom_text(aes(fill=slp, label = ..count..), stat = "count", vjust = 0.5, colour = "black",
position = position_dodge(.9))+
theme(axis.title = element_text(face="bold",color="black",size=13),
legend.position = "top",
axis.text = element_text(color="black",size=12),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.major.x = element_blank())
figure3 <- ggarrange( p12, p13, p14, p15, p16, p17,
ncol = 3, nrow = 2)
annotate_figure(figure3,
top = text_grob("Distribution by Heart disease of various categorical features",
color = "red", face = "bold", size = 15),
bottom = text_grob("Data source: \n heart disease data set", color = "blue",
hjust = 1, x = 0.98, face = "italic", size = 10),
#left = text_grob("Figure arranged using ggpubr", color = "green", rot = 90),
#right = "",
fig.lab = "Figure 3", fig.lab.face = "bold"
)
p18 <- ggplot(heart_disease_final, aes(x=age, fill=output)) +
geom_density(alpha=0.4)+
scale_y_continuous(n.breaks = 10)+
scale_x_continuous(breaks = seq(25, 80, by = 5), limits = c(25, 80))+
labs(x="age",y="density", title = "Age of the patient")+
theme(axis.title = element_text(color="black",size=12),
legend.position = "top",
axis.text = element_text(color="black",size=11),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())+
guides(fill = guide_legend(title = ""))
p19 <-ggplot(heart_disease_final, aes(x=trtbps, fill=output)) +
geom_density(alpha=0.4)+
scale_y_continuous(n.breaks = 10)+
scale_x_continuous(breaks = seq(90, 200, by = 10), limits = c(90, 200))+
labs(x="Resting blood pressure (in mm Hg)",y="density", title = "Resting blood pressure")+
theme(axis.title = element_text(color="black",size=12),
legend.position = "top",
axis.text.y = element_text(color="black",size=11),
axis.text.x = element_text(color="black",size=9, angle = 90),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())+
guides(fill = guide_legend(title = ""))
p20 <-ggplot(heart_disease_final, aes(x=chol, fill=output)) +
geom_density(alpha=0.4)+
scale_y_continuous(n.breaks = 10)+
scale_x_continuous(breaks = seq(50, 650, by = 50), limits = c(50, 650))+
labs(x="cholestoral in mg/dl",y="density", title = "cholestoral")+
theme(axis.title = element_text(color="black",size=12),
legend.position = "top",
axis.text.y = element_text(color="black",size=11),
axis.text.x = element_text(color="black",size=9, angle = 90),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())+
guides(fill = guide_legend(title = ""))
p21 <-ggplot(heart_disease_final, aes(x=thalach, fill=output)) +
geom_density(alpha=0.4)+
scale_y_continuous(n.breaks = 10)+
scale_x_continuous(breaks = seq(60, 210, by = 10), limits = c(60, 210))+
labs(x="Maximum heart rate achieved",y="density", title = "Maximum heart rate")+
theme(axis.title = element_text(color="black",size=12),
legend.position = "top",
axis.text.y = element_text(color="black",size=11),
axis.text.x = element_text(color="black",size=9, angle=90),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())+
guides(fill = guide_legend(title = ""))
p22 <- ggplot(heart_disease_final, aes(output, age))+
geom_boxplot(aes(colour = output), outlier.colour = "black",
outlier.shape = 8, show.legend = FALSE)+
labs(x="",y="age", title = "Age of the patient")+
scale_y_continuous(breaks = seq(25, 80, by = 5), limits = c(25, 80))+
theme(axis.title = element_text(face="bold",color="black",size=13),
axis.text = element_text(color="black",size=12),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank())
p23 <- ggplot(heart_disease_final, aes(output, trtbps))+
geom_boxplot(aes(colour = output), outlier.colour = "black",
outlier.shape = 1, show.legend = FALSE)+
labs(x="",y="Resting blood pressure (mm Hg)", title = "Resting blood pressure")+
scale_y_continuous(breaks = seq(90, 200, by = 10), limits = c(90, 200))+
theme(axis.title = element_text(face="bold",color="black",size=13),
axis.text = element_text(color="black",size=12),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank())
p24 <-ggplot(heart_disease_final, aes(output, chol))+
geom_boxplot(aes(colour = output), outlier.colour = "black",
outlier.shape = 1, show.legend = FALSE)+
labs(x="",y="cholestoral in mg/dl", title = "cholestoral")+
scale_y_continuous(breaks = seq(50, 650, by = 50), limits = c(50, 650))+
theme(axis.title = element_text(face="bold",color="black",size=13),
axis.text = element_text(color="black",size=12),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank())
p25 <-ggplot(heart_disease_final, aes(output, thalach))+
geom_boxplot(aes(colour = output), outlier.colour = "black",
outlier.shape = 1, show.legend = FALSE)+
labs(x="",y="Maximum heart rate achieved", title = "Maximum heart rate")+
scale_y_continuous(breaks = seq(60, 200, by = 10), limits = c(60, 200))+
theme(axis.title = element_text(face="bold",color="black",size=13),
axis.text = element_text(color="black",size=12),
plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank())
figure4 <- ggarrange( p18, p19, p20, p21, p22, p23, p24, p25,
ncol = 4, nrow = 2)
annotate_figure(figure4,
top = text_grob("Distribution by Heart disease of various continous features",
color = "red", face = "bold", size = 15),
bottom = text_grob("Data source: \n heart disease data set", color = "blue",
hjust = 1, x = 0.98, face = "italic", size = 10),
#left = text_grob("Figure arranged using ggpubr", color = "green", rot = 90),
#right = "",
fig.lab = "Figure 4", fig.lab.face = "bold"
)
A negative correlation implies that the two variables under consideration vary in opposite directions, that is, if a variable increases the other decreases and vice versa. On the other hand, a positive correlation implies that the two variables under consideration vary in the same direction, i.e., if a variable increases the other one increases and if one decreases the other one decreases as well
# improved correlation matrix
library(corrplot)
corrplot(cor(heart_disease_final%>%mutate(across(c(2,3,6, 7, 9, 11, 12), as.numeric)),
method='spearman'),
method = "color", #number
addCoef.col = "black",
number.cex = 0.95,
type = "upper" # show only upper side #full
)
Only correlations with p-values smaller than the significance level (p<0.05)should be interpreted.
# correlation tests for whole dataset
library(Hmisc)
res <- rcorr(as.matrix(heart_disease_final%>%mutate(across(c(2,3,6, 7, 9, 11, 12), as.numeric)))) # rcorr() accepts matrices only
# display p-values (rounded to 3 decimals)
kable(
round(res$P, 3)
)
age | sex | cp | trtbps | chol | fbs | restecg | thalach | exng | oldpeak | slp | output | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
age | NA | 0.264 | 0.000 | 0.000 | 0.109 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
sex | 0.264 | NA | 0.000 | 0.349 | 0.003 | 0.009 | 0.713 | 0.000 | 0.000 | 0.001 | 0.000 | 0.000 |
cp | 0.000 | 0.000 | NA | 0.235 | 0.029 | 0.236 | 0.032 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
trtbps | 0.000 | 0.349 | 0.235 | NA | 0.009 | 0.000 | 0.090 | 0.001 | 0.000 | 0.000 | 0.000 | 0.000 |
chol | 0.109 | 0.003 | 0.029 | 0.009 | NA | 0.141 | 0.064 | 0.588 | 0.018 | 0.110 | 0.032 | 0.005 |
fbs | 0.000 | 0.009 | 0.236 | 0.000 | 0.141 | NA | 0.000 | 0.005 | 0.003 | 0.129 | 0.000 | 0.000 |
restecg | 0.000 | 0.713 | 0.032 | 0.090 | 0.064 | 0.000 | NA | 0.488 | 0.194 | 0.001 | 0.003 | 0.002 |
thalach | 0.000 | 0.000 | 0.000 | 0.001 | 0.588 | 0.005 | 0.488 | NA | 0.000 | 0.000 | 0.000 | 0.000 |
exng | 0.000 | 0.000 | 0.000 | 0.000 | 0.018 | 0.003 | 0.194 | 0.000 | NA | 0.000 | 0.000 | 0.000 |
oldpeak | 0.000 | 0.001 | 0.000 | 0.000 | 0.110 | 0.129 | 0.001 | 0.000 | 0.000 | NA | 0.000 | 0.000 |
slp | 0.000 | 0.000 | 0.000 | 0.000 | 0.032 | 0.000 | 0.003 | 0.000 | 0.000 | 0.000 | NA | 0.000 |
output | 0.000 | 0.000 | 0.000 | 0.000 | 0.005 | 0.000 | 0.002 | 0.000 | 0.000 | 0.000 | 0.000 | NA |
After exploring the data, a statistical model will be used to give the final conclusion of the relationship. The dependent variable, “output” is categorical with two variables while the independent variables are more than one and are a mixture of categorical and continuous data and hence a binary logistic model will be applied. This model will be trained and tested and used to make some predictions.
spliting the data in a 80:20 ratio (training:testing). We use set.seed() to make sure that the results are repeatable. We also use the outcome variable, output to stratify. This is to ensure that the distribution of the outcome is comparable in both data sets.
library(rsample)
set.seed(123)
split <- initial_split(heart_disease_final, prop = 0.8, strata = output)
train <- training(split)
test <- testing(split)
#initial_time_split() takes the 1st prop samples for training,instead of random selection.
#sample=sample.split(mydata, SplitRatio = 0.80)
#train<-subset(mydata, sample==TRUE)
#test<-subset(mydata, sample==FALSE)
glm_logistic_reg <-glm(formula=output~.,data=train, family="binomial")
#summary(glm_logistic_reg)
broom::glance(glm_logistic_reg)
## # A tibble: 1 x 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 828. 597 -198. 429. 499. 397. 582 598
#performance::check_collinearity(glm_logistic_reg)
performance::check_model(glm_logistic_reg)
#performance::performance_accuracy(glm_logistic_reg)
#plot(performance::performance_roc(glm_logistic_reg))
tbl_regression(glm_logistic_reg, exponentiate = TRUE,
pvalue_fun = ~style_pvalue(.x, digits = 3)
)%>%
#add_global_p()%>% # add global p-value for categorical variables
bold_p(t= 0.05) %>% # bold p-values under a given threshold (default 0.05)
bold_labels() %>%
italicize_levels()%>%
modify_header(label = "**Logistic regression**")%>% # update the column header
add_significance_stars(
pattern = "{estimate} ({conf.low}-{conf.high}){stars}",
hide_ci = TRUE, hide_se = TRUE , hide_p = FALSE) %>%
modify_header(estimate ~ "**AOR (95% CI)**") %>%
modify_footnote(estimate ~ "AOR = Adjusted Odds Ratio, CI = Confidence Interval", abbreviation = TRUE)
Logistic regression | AOR (95% CI)1,2 | p-value |
---|---|---|
Age of the patient | 1.03 (1.00-1.06) | 0.069 |
Sex of the patient | ||
Female | — | |
Male | 5.08 (2.62-10.2)*** | <0.001 |
Chest pain type | ||
typical angina | — | |
atypical angina | 0.76 (0.23-2.48) | 0.641 |
non-anginal | 0.87 (0.30-2.56) | 0.801 |
asymptomatic | 3.88 (1.40-11.3)* | 0.010 |
Resting blood pressure (in mm Hg) | 1.01 (0.99-1.03) | 0.266 |
Cholestoral in mg/dl | 1.00 (1.00-1.01) | 0.098 |
Fasting blood sugar | ||
<= 120mg/dl | — | |
>120mg/dl | 1.52 (0.75-3.12) | 0.251 |
Resting electrocardiographic results | ||
normal | — | |
having ST_T abnormality | 0.95 (0.45-2.02) | 0.885 |
showing probable or definate left ventricular hypertrophy | 1.24 (0.65-2.36) | 0.510 |
Maximum heart rate achieved | 1.00 (0.99-1.01) | 0.936 |
Exercise induced angina | ||
No | — | |
Yes | 2.43 (1.33-4.42)** | 0.004 |
ST depression | 1.50 (1.11-2.06)** | 0.010 |
Slope of peak exercise | ||
upsloping | — | |
flat | 10.9 (6.10-20.2)*** | <0.001 |
downsloping | 3.76 (1.17-12.4)* | 0.027 |
1 *p<0.05; **p<0.01; ***p<0.001 | ||
2 AOR = Adjusted Odds Ratio, CI = Confidence Interval |
Statistically significant variables are Male(Sex), Asymptomatic (chest pain type), Yes (Angina), ST depression, flat (slope of peak exercise) and downsloping (slope of peak exercise).
Having Asymptomatic chest pain type, Angina, flat slope of peak exercise, downsloping slope of peak exercise, high ST depression and being Male are significantly associated with higher odds of having a heart attack. e.g People with Angina are 2.43 times more likely to get heart attack than those with no Angina. Males are 5.08 times more likely to get heart attack than females.
atypical angina chest pain type, non-anginal chest pain type, having ST_T abnormality sesting electrocardiographic results are associated with lower odds of heart attack. However they do not have a significant effect on heart attack.
test$prediction <-predict(glm_logistic_reg, test, type= "response")
##replacing probabilities with dummies
test$predicted <- as.factor(ifelse(test$prediction<0.5, "No heart attack", "Has Heart attack"))
test <- test%>%
mutate(predicted= factor(predicted, levels = c("No heart attack", "Has Heart attack")))
test1 <- test%>%
mutate(across(c(predicted, output), fct_rev))
##creating a confusion matrix
table(Predicted=test1$predicted, Actual=test1$output)
## Actual
## Predicted Has Heart attack No heart attack
## Has Heart attack 68 14
## No heart attack 3 63
##getting percentage of misclassification
1-sum(diag(table(Predicted=test1$predicted, Actual=test1$output)))/sum(table(Predicted=test1$predicted, Actual=test1$output))
## [1] 0.1148649
##getting percentage of accuracy
sum(diag(table(Predicted=test1$predicted, Actual=test1$output)))/sum(table(Predicted=test1$predicted, Actual=test1$output))
## [1] 0.8851351
##Testing for goodness of fit.
with(glm_logistic_reg, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))
## [1] 2.407305e-82
library(caret)
confusionMatrix(test1$predicted, test1$output)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Has Heart attack No heart attack
## Has Heart attack 68 14
## No heart attack 3 63
##
## Accuracy : 0.8851
## 95% CI : (0.8225, 0.9316)
## No Information Rate : 0.5203
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.7713
##
## Mcnemar's Test P-Value : 0.01529
##
## Sensitivity : 0.9577
## Specificity : 0.8182
## Pos Pred Value : 0.8293
## Neg Pred Value : 0.9545
## Prevalence : 0.4797
## Detection Rate : 0.4595
## Detection Prevalence : 0.5541
## Balanced Accuracy : 0.8880
##
## 'Positive' Class : Has Heart attack
##
True positives (TP) we have 68 correctly classified instances of heart attack
True negatives(TN) we have 63 cases classified as not having heart attack
False negatives (FN) we have 3 cases our model said the patient didn’t have heart attack and they actually did
False positives (FP) we have 14 cases our model said a patient did have heart attack, but they actually didn’t
Sensitivity (Recall) The true positive rate is 0.9577 meaning from all the patients that had heart attack ; how many did we predict correctly - this equation is Sensitivity =TP / (TP + FN)
Specificity The true negative rate is 0.8182 meaning from all the patients that didn’t have heart attack - how many did we predict correctly - this equation is Specificity =TN / (TN + FP)
Positive Predictive Value (Precision) is 0.8293 meaning from all the classes we predicted as positive, how many were actually positive. equation is Positive Predictive Value=TP/(TP + FP)
Negative Predictive Value is 0.9545 meaning from all the classes we predicted as negative, how many were actually negative. equation is Negative Predictive Value=TN/(TN + FN)
Accuracy of the model is 88.5% with p value of 2.407305e-82 (less than the default level of significance (0.05)).
plot(performance::performance_roc(glm_logistic_reg, new_data = test))