1 Introduction

1.1 Overview

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. This makes it important for health facilities to be able to predict occurrences of heart diseases in order to make plans for early treatment (Assmann et al. 2005). Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

1.2 About the Data

This dataset contains 918 observations with 12 features. The data dictionary is as follows:

output: target variable (1=has heart attack, 0= has no heart attack)
age: Age of the patient
sex: Sex of the patient (1= male, 0= female).
cp: chest pain type (0=typical angina, 1=atypical angina, 2=non-anginal and 3=asymptomatic)
trtbps: resting blood pressure (in mm Hg)
chol: cholestoral in mg/dl fetched via BMI sensor
fbs: fasting blood sugar (>120mg/dl=1, <= 120mg/dl=0).
restecg: resting electrocardiographic results( 0=normal, 1=having ST-T wave abnormality, 2=showing probable or definate left ventricular hypertrophy by Este’s criteria.)
thalach: Maximum heart rate achieved.
exng: exercise induced angina; 1=yes, 0=no.
oldpeak: ST depression.
slp = slope of peak exercise; 0=upsloping, 1=flat, 2=downsloping.

2 Data Exploration

2.1 loading Relevant packages

#Import relevant packages

library(tidyverse)
library(janitor)
#library(readr)
library(plotly)
library(knitr)

2.2 loading Data Set

heart_disease <- read_csv('https://raw.githubusercontent.com/reinpmomz/Predicting-heart-disease/main/data/heart.csv')%>%
  clean_names()


#View(heart_disease)
head(heart_disease)

## # A tibble: 6 x 12
##     age   sex    cp trtbps  chol   fbs restecg thalach  exng oldpeak   slp
##   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl>   <dbl> <dbl>
## 1    40     1     1    140   289     0       0     172     0     0       0
## 2    49     0     2    160   180     0       0     156     0     1       1
## 3    37     1     1    130   283     0       1      98     0     0       0
## 4    48     0     3    138   214     0       0     108     1     1.5     1
## 5    54     1     2    150   195     0       0     122     0     0       0
## 6    39     1     2    120   339     0       0     170     0     0       0
## # ... with 1 more variable: output <dbl>

##checking the data structure

str(heart_disease)

## spc_tbl_ [918 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age    : num [1:918] 40 49 37 48 54 39 45 54 37 48 ...
##  $ sex    : num [1:918] 1 0 1 0 1 1 0 1 1 0 ...
##  $ cp     : num [1:918] 1 2 1 3 2 2 1 1 3 1 ...
##  $ trtbps : num [1:918] 140 160 130 138 150 120 130 110 140 120 ...
##  $ chol   : num [1:918] 289 180 283 214 195 339 237 208 207 284 ...
##  $ fbs    : num [1:918] 0 0 0 0 0 0 0 0 0 0 ...
##  $ restecg: num [1:918] 0 0 1 0 0 0 0 0 0 0 ...
##  $ thalach: num [1:918] 172 156 98 108 122 170 170 142 130 120 ...
##  $ exng   : num [1:918] 0 0 0 1 0 0 0 0 1 0 ...
##  $ oldpeak: num [1:918] 0 1 0 1.5 0 0 0 0 1.5 0 ...
##  $ slp    : num [1:918] 0 1 0 1 0 0 0 0 1 0 ...
##  $ output : num [1:918] 0 1 0 1 0 0 0 0 1 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   sex = col_double(),
##   ..   cp = col_double(),
##   ..   trtbps = col_double(),
##   ..   chol = col_double(),
##   ..   fbs = col_double(),
##   ..   restecg = col_double(),
##   ..   thalach = col_double(),
##   ..   exng = col_double(),
##   ..   oldpeak = col_double(),
##   ..   slp = col_double(),
##   ..   output = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

From the output on the data structure, all of the data has been read as numeric values(‘double’ value or a decimal type with at least two decimal places) but some should be converted to factors since they are categorical.

2.3 Converting into factors

heart_disease_final <- heart_disease%>%
  mutate(output= factor(output, levels = c(0,1), 
                                      labels = c("No heart attack", "Has Heart attack")))%>%
  mutate(sex= factor(sex, levels = c(0,1), 
                                      labels = c("Female", "Male")))%>%
  mutate(cp= factor(cp, levels = c(0,1,2,3), 
          labels = c("typical angina", "atypical angina", "non-anginal", "asymptomatic")))%>%
  mutate(fbs= factor(fbs, levels = c(0,1), 
                                      labels = c("<= 120mg/dl", ">120mg/dl")))%>%
  mutate(restecg= factor(restecg, levels = c(0,1,2), 
          labels = c("normal", "having ST_T \n abnormality", "showing probable or \n definate left\n ventricular hypertrophy")))%>%
  mutate(exng= factor(exng, levels = c(0,1), 
                                      labels = c("No", "Yes")))%>%
  mutate(slp= factor(slp, levels = c(0,1,2), 
          labels = c("upsloping", "flat", "downsloping")))%>%
  filter(trtbps>0)%>%
  filter(chol>0)

var.labels = c(
  age = "Age of the patient",
  sex = "Sex of the patient",
  cp = "Chest pain type",
  trtbps = "Resting blood pressure (in mm Hg)",
  chol = "Cholestoral in mg/dl",
  fbs = "Fasting blood sugar",
  restecg = "Resting electrocardiographic results",
  thalach = "Maximum heart rate achieved",
  exng = "Exercise induced angina",
  oldpeak = "ST depression",
  slp = "Slope of peak exercise",
  output = "Heart disease")

heart_disease_final <- labelled::set_variable_labels(heart_disease_final, .labels = var.labels)

str(heart_disease_final)

## tibble [746 x 12] (S3: tbl_df/tbl/data.frame)
##  $ age    : num [1:746] 40 49 37 48 54 39 45 54 37 48 ...
##   ..- attr(*, "label")= chr "Age of the patient"
##  $ sex    : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 1 2 2 1 ...
##   ..- attr(*, "label")= chr "Sex of the patient"
##  $ cp     : Factor w/ 4 levels "typical angina",..: 2 3 2 4 3 3 2 2 4 2 ...
##   ..- attr(*, "label")= chr "Chest pain type"
##  $ trtbps : num [1:746] 140 160 130 138 150 120 130 110 140 120 ...
##   ..- attr(*, "label")= chr "Resting blood pressure (in mm Hg)"
##  $ chol   : num [1:746] 289 180 283 214 195 339 237 208 207 284 ...
##   ..- attr(*, "label")= chr "Cholestoral in mg/dl"
##  $ fbs    : Factor w/ 2 levels "<= 120mg/dl",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Fasting blood sugar"
##  $ restecg: Factor w/ 3 levels "normal","having ST_T \n abnormality",..: 1 1 2 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Resting electrocardiographic results"
##  $ thalach: num [1:746] 172 156 98 108 122 170 170 142 130 120 ...
##   ..- attr(*, "label")= chr "Maximum heart rate achieved"
##  $ exng   : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 2 1 ...
##   ..- attr(*, "label")= chr "Exercise induced angina"
##  $ oldpeak: num [1:746] 0 1 0 1.5 0 0 0 0 1.5 0 ...
##   ..- attr(*, "label")= chr "ST depression"
##  $ slp    : Factor w/ 3 levels "upsloping","flat",..: 1 2 1 2 1 1 1 1 2 1 ...
##   ..- attr(*, "label")= chr "Slope of peak exercise"
##  $ output : Factor w/ 2 levels "No heart attack",..: 1 2 1 2 1 1 1 1 2 1 ...
##   ..- attr(*, "label")= chr "Heart disease"

The categorial columns are sex, cp, fbs, restecg, exng, slp while continuous columns are age, trtbps, chol, thalach, oldpeak and target variable is output

2.4 checking missing values

mean(is.na(heart_disease_final))

## [1] 0

#which(is.na(heart_disease_final))

#which(!complete.cases(heart_disease_final))

There were no missing values in our data set.

3 Exploratory data analysis

Exploring data is vital in giving a clue of the expected relationship between the dependent variable and the exploratory variables (Zuur et al. 2010).

3.1 Univariate analysis

This is analysis of one variable to enable us understand the distribution of values for a single variable.

3.1.1 Descriptives Frequency table

library(gtsummary)
library(flextable)

set_gtsummary_theme(list(
  "tbl_summary-fn:percent_fun" = function(x) style_percent(x, digits = 1),
  "tbl_summary-str:categorical_stat" = "{n} ({p}%)"
))
# Setting `Compact` theme
theme_gtsummary_compact()

# make dataset with variables to summarize

      
tbl_summary(heart_disease_final,
                      type = list(
                        all_dichotomous() ~ "categorical",
                         all_continuous() ~ "continuous2")
                      , statistic = all_continuous() ~ c(
                                     "{mean} ({sd})", 
                                     "{median} ({p25}, {p75})", 
                                     "{min}, {max}")
                      , digits = all_continuous() ~ 2
                      , missing = "always" # don't list missing data separately
                      ,missing_text = "Missing"
                      ) %>% 
  modify_header(label = "**Descriptives**") %>% # update the column header
  bold_labels() %>%
  italicize_levels()%>%
  add_n() # add column with total number of non-missing observations

Descriptives	N	N = 746¹
Age of the patient	746
Mean (SD)		52.88 (9.51)
Median (IQR)		54.00 (46.00, 59.00)
Range		28.00, 77.00
Missing		0
Sex of the patient	746
Female		182 (24.4%)
Male		564 (75.6%)
Missing		0
Chest pain type	746
typical angina		41 (5.50%)
atypical angina		166 (22.3%)
non-anginal		169 (22.7%)
asymptomatic		370 (49.6%)
Missing		0
Resting blood pressure (in mm Hg)	746
Mean (SD)		133.02 (17.28)
Median (IQR)		130.00 (120.00, 140.00)
Range		92.00, 200.00
Missing		0
Cholestoral in mg/dl	746
Mean (SD)		244.64 (59.15)
Median (IQR)		237.00 (207.25, 275.00)
Range		85.00, 603.00
Missing		0
Fasting blood sugar	746
<= 120mg/dl		621 (83.2%)
>120mg/dl		125 (16.8%)
Missing		0
Resting electrocardiographic results	746
normal		445 (59.7%)
having ST_T abnormality		125 (16.8%)
showing probable or definate left ventricular hypertrophy		176 (23.6%)
Missing		0
Maximum heart rate achieved	746
Mean (SD)		140.23 (24.52)
Median (IQR)		140.00 (122.00, 160.00)
Range		69.00, 202.00
Missing		0
Exercise induced angina	746
No		459 (61.5%)
Yes		287 (38.5%)
Missing		0
ST depression	746
Mean (SD)		0.90 (1.07)
Median (IQR)		0.50 (0.00, 1.50)
Range		-0.10, 6.20
Missing		0
Slope of peak exercise	746
upsloping		349 (46.8%)
flat		354 (47.5%)
downsloping		43 (5.76%)
Missing		0
Heart disease	746
No heart attack		390 (52.3%)
Has Heart attack		356 (47.7%)
Missing		0
¹ n (%)

3.1.2 Visualization

library(ggpubr)

p1 <- ggplot(heart_disease_final, aes(x=output))+
  geom_bar(aes(fill = output), show.legend = FALSE)+
  labs(x="",y="", title = "Heart disease")+
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
  #guides(fill = FALSE)+
  theme_pubclean()+
  theme(axis.title = element_text(face="bold",color="black",size=13),
        #legend.position = "none",
        axis.text.y = element_text(color="black",size=12),
        axis.text.x = element_text(color="black",size=11),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.major.x = element_blank())

p2 <- ggplot(heart_disease_final, aes(x=sex))+
  geom_bar(aes(fill = sex), show.legend = FALSE)+
  labs(x="",y="", title = "Sex of the patient")+
  scale_y_continuous(breaks = seq(0, 600, by = 100), limits = c(0, 600))+
  theme_pubclean()+
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
  theme(axis.title = element_text(face="bold",color="black",size=13),
        axis.text = element_text(color="black",size=12),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.major.x = element_blank())

p3 <- ggplot(heart_disease_final, aes(x=cp))+
  geom_bar(aes(fill = cp), show.legend = FALSE)+
  labs(x="",y="", title = "Chest pain Type")+
  scale_y_continuous(breaks = seq(0, 400, by = 50), limits = c(0, 400))+
  theme_pubclean()+
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
  theme(axis.title = element_text(face="bold",color="black",size=13),
        axis.text.y = element_text(color="black",size=12),
        axis.text.x = element_text(color="black",size=12, angle = 90),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.major.x = element_blank())


p4 <- ggplot(heart_disease_final, aes(x=fbs))+
  geom_bar(aes(fill = fbs), show.legend = FALSE)+
  labs(x="",y="", title = "Fasting blood sugar")+
  scale_y_continuous(breaks = seq(0, 750, by = 100), limits = c(0, 750))+
 theme_pubclean()+
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
  theme(axis.title= element_text(face="bold",color="black",size=13),
        axis.text = element_text(color="black",size=12),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.major.x = element_blank())

p5 <- ggplot(heart_disease_final, aes(x=restecg))+
  geom_bar(aes(fill = restecg), show.legend = FALSE)+
  labs(x="",y="", title = "Resting electrocardiographic results")+
  scale_y_continuous(breaks = seq(0, 500, by = 100), limits = c(0, 500))+
  theme_pubclean()+
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
  theme(axis.title = element_text(face="bold",color="black",size=13),
        axis.text.y = element_text(color="black",size=12),
        axis.text.x = element_text(color="black",size=11, angle = 90),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.major.x = element_blank())

p6 <- ggplot(heart_disease_final, aes(x=exng))+
  geom_bar(aes(fill = exng), show.legend = FALSE)+
  labs(x="",y="", title = "Exercise induced angina")+
  scale_y_continuous(breaks = seq(0, 500, by = 100), limits = c(0, 500))+
  theme_pubclean()+
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
  theme(axis.title = element_text(face="bold",color="black",size=13),
        axis.text = element_text(color="black",size=12),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.major.x = element_blank())


p7 <- ggplot(heart_disease_final, aes(x=slp))+
  geom_bar(aes(fill = slp), show.legend = FALSE)+
  labs(x="",y="", title = "Slope of peak exercise")+
  scale_y_continuous(breaks = seq(0, 400, by = 50), limits = c(0, 400))+
  theme_pubclean()+
  geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "black")+
  theme(axis.title = element_text(face="bold",color="black",size=13),
        axis.text = element_text(color="black",size=12),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.major.x = element_blank())


figure1 <- ggarrange(p1, p2, p3, p4, p5, p6, p7, 
          #labels = c("A", "B", "C","D", "E", "F", "G"),
          ncol = 4, nrow = 2)

annotate_figure(figure1,
                top = text_grob("Count plots for various categorical features",
                                color = "red", face = "bold", size = 15),
                bottom = text_grob("Data source: \n heart disease data set", color = "blue",
                                   hjust = 1, x = 0.98, face = "italic", size = 10),
                #left = text_grob("Figure arranged using ggpubr", color = "green", rot = 90),
                #right = "",
                fig.lab = "Figure 1", fig.lab.face = "bold"
                )

p8 <- ggplot(heart_disease_final, aes(x=age)) + 
  geom_histogram(color="black", fill="white")+
  geom_vline(aes(xintercept=mean(age)),
            color="blue", linetype="dashed", size=1)+
  scale_y_continuous(breaks = seq(0, 85, by = 10), limits = c(0, 80))+
  scale_x_continuous(n.breaks = 10)+
  labs(x="age",y="count", title = "Age of the patient")+
  theme(axis.title = element_text(color="black",size=12),
        axis.text = element_text(color="black",size=11),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank())

p9 <-ggplot(heart_disease_final, aes(x=trtbps)) + 
  geom_histogram(color="black", fill="lightblue")+
  geom_vline(aes(xintercept=mean(trtbps)),
            color="blue", linetype="dashed", size=1)+
   scale_y_continuous(breaks = seq(0, 130, by = 10), limits = c(0, 130))+
  scale_x_continuous(n.breaks = 10)+
  labs(x="Resting blood pressure (in mm Hg)",y="count", title = "Resting blood pressure")+
  theme(axis.title = element_text(color="black",size=12),
        axis.text = element_text(color="black",size=11),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank())

p10 <-ggplot(heart_disease_final, aes(x=chol)) + 
  geom_histogram(color="black", fill="pink")+
  geom_vline(aes(xintercept=mean(chol)),
            color="blue", linetype="dashed", size=1)+
  scale_y_continuous(breaks = seq(0, 120, by = 10), limits = c(0, 120))+
  scale_x_continuous(n.breaks = 12)+
  labs(x="cholestoral in mg/dl",y="count", title = "cholestoral")+
  theme(axis.title = element_text(color="black",size=12),
        axis.text = element_text(color="black",size=11),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank())

p11 <-ggplot(heart_disease_final, aes(x=thalach)) + 
  geom_histogram(color="black", fill="lightgreen")+
  geom_vline(aes(xintercept=mean(thalach)),
            color="blue", linetype="dashed", size=1)+
  scale_y_continuous(breaks = seq(0, 80, by = 10), limits = c(0, 80))+
  scale_x_continuous(n.breaks = 12)+
  labs(x="Maximum heart rate achieved",y="count", title = "Maximum heart rate")+
  theme(axis.title = element_text(color="black",size=12),
        axis.text = element_text(color="black",size=11),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank())


figure2 <- ggarrange(p8, p9, p10, p11,
          ncol = 2, nrow = 2)

annotate_figure(figure2,
                top = text_grob("Histograms for various continous features",
                                color = "red", face = "bold", size = 15),
                bottom = text_grob("Data source: \n heart disease data set", color = "blue",
                                   hjust = 1, x = 0.98, face = "italic", size = 10),
                #left = text_grob("Figure arranged using ggpubr", color = "green", rot = 90),
                #right = "",
                fig.lab = "Figure 2", fig.lab.face = "bold"
                )

3.2 Bivariate analysis

3.2.1 Difference Frequency table

# make dataset with variables to summarize

      
tbl_summary(heart_disease_final,
             by = output,
                      type = list(
                        all_dichotomous() ~ "categorical",
                         all_continuous() ~ "continuous2")
                      , statistic = all_continuous() ~ c(
                                     "{mean} ({sd})", 
                                     "{median} ({p25}, {p75})", 
                                     "{min}, {max}")
                      , digits = all_continuous() ~ 2
                      , missing = "always" # don't list missing data separately
                      ,missing_text = "Missing"
                      ) %>% 
  modify_header(label = "**Variables**") %>% # update the column header
  bold_labels() %>%
  italicize_levels()%>%
  add_n()%>% # add column with total number of non-missing observations
  add_p(pvalue_fun = ~style_pvalue(.x, digits = 3),
        c(age, trtbps, chol, thalach, oldpeak) ~ "t.test",
        test.args = c(age, trtbps, chol, thalach, oldpeak) ~ list(var.equal = TRUE)) %>%
  bold_p(t= 0.05) %>% # bold p-values under a given threshold (default 0.05)
  #add_overall() %>%
  #add_difference() %>% #add column for difference between two group, confidence interval, and p-value
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Heart Disease**")  %>%
  #modify_caption("**Table 1. Patient Characteristics**")%>%
  modify_footnote(
    all_stat_cols() ~ "Mean (SD); Median (IQR); Range; Frequency (%)"
  )

Variables	N	Heart Disease		p-value²
Variables	N	No heart attack, N = 390¹	Has Heart attack, N = 356¹	p-value²
Age of the patient	746			<0.001
Mean (SD)		50.17 (9.31)	55.85 (8.82)
Median (IQR)		51.00 (43.00, 56.00)	57.00 (50.00, 62.00)
Range		28.00, 76.00	31.00, 77.00
Missing		0	0
Sex of the patient	746			<0.001
Female		142 (36.4%)	40 (11.2%)
Male		248 (63.6%)	316 (88.8%)
Missing		0	0
Chest pain type	746			<0.001
typical angina		26 (6.67%)	15 (4.21%)
atypical angina		145 (37.2%)	21 (5.90%)
non-anginal		123 (31.5%)	46 (12.9%)
asymptomatic		96 (24.6%)	274 (77.0%)
Missing		0	0
Resting blood pressure (in mm Hg)	746			<0.001
Mean (SD)		130.16 (16.14)	136.15 (17.96)
Median (IQR)		130.00 (120.00, 140.00)	136.00 (123.75, 145.00)
Range		94.00, 190.00	92.00, 200.00
Missing		0	0
Cholestoral in mg/dl	746			0.005
Mean (SD)		238.77 (55.39)	251.06 (62.46)
Median (IQR)		231.50 (203.00, 269.00)	246.00 (212.00, 283.25)
Range		85.00, 564.00	100.00, 603.00
Missing		0	0
Fasting blood sugar	746			<0.001
<= 120mg/dl		347 (89.0%)	274 (77.0%)
>120mg/dl		43 (11.0%)	82 (23.0%)
Missing		0	0
Resting electrocardiographic results	746			<0.001
normal		257 (65.9%)	188 (52.8%)
having ST_T abnormality		52 (13.3%)	73 (20.5%)
showing probable or definate left ventricular hypertrophy		81 (20.8%)	95 (26.7%)
Missing		0	0
Maximum heart rate achieved	746			<0.001
Mean (SD)		149.06 (23.11)	130.55 (22.30)
Median (IQR)		150.50 (135.00, 167.00)	130.00 (115.00, 147.00)
Range		69.00, 202.00	71.00, 195.00
Missing		0	0
Exercise induced angina	746			<0.001
No		340 (87.2%)	119 (33.4%)
Yes		50 (12.8%)	237 (66.6%)
Missing		0	0
ST depression	746			<0.001
Mean (SD)		0.39 (0.68)	1.46 (1.15)
Median (IQR)		0.00 (0.00, 0.60)	1.50 (0.50, 2.00)
Range		-0.10, 4.20	0.00, 6.20
Missing		0	0
Slope of peak exercise	746			<0.001
upsloping		304 (77.9%)	45 (12.6%)
flat		75 (19.2%)	279 (78.4%)
downsloping		11 (2.82%)	32 (8.99%)
Missing		0	0
¹ Mean (SD); Median (IQR); Range; Frequency (%)
² Two Sample t-test; Pearson's Chi-squared test

3.2.2 Visualization

p12 <- ggplot(heart_disease_final, aes(x=output))+
  geom_bar(aes(fill = sex), show.legend = TRUE, position="dodge")+
  labs(x="",y="", title = "Sex of the patient", fill="")+
  scale_y_continuous(breaks = seq(0, 350, by = 50), limits = c(0, 350))+
  theme_pubclean()+
  geom_text(aes(fill=sex, label = ..count..), stat = "count", vjust = 1.2, colour = "black",
            position = position_dodge(.9))+
  theme(axis.title = element_text(face="bold",color="black",size=13),
        legend.position = "top",
        axis.text = element_text(color="black",size=12),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.major.x = element_blank())

p13 <- ggplot(heart_disease_final, aes(x=output))+
  geom_bar(aes(fill = cp), show.legend = TRUE, position="dodge")+
  labs(x="",y="", title = "Chest pain Type", fill="")+
  scale_y_continuous(breaks = seq(0, 300, by = 50), limits = c(0, 300))+
  theme_pubclean()+
  geom_text(aes(fill=cp, label = ..count..), stat = "count", vjust = 0.5, colour = "black",
            position = position_dodge(.9))+
  theme(axis.title = element_text(face="bold",color="black",size=13),
        legend.position = "top",
        axis.text.y = element_text(color="black",size=12),
        axis.text.x = element_text(color="black",size=12),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.major.x = element_blank())+
  guides(fill = guide_legend(nrow = 2))


p14 <- ggplot(heart_disease_final, aes(x=output))+
  geom_bar(aes(fill = fbs), show.legend = TRUE, position="dodge")+
  labs(x="",y="", title = "Fasting blood sugar", fill="")+
  scale_y_continuous(breaks = seq(0, 350, by = 50), limits = c(0, 350))+
 theme_pubclean()+
  geom_text(aes(fill=fbs, label = ..count..), stat = "count", vjust = 1.2, colour = "black",
            position = position_dodge(.9))+
  theme(axis.title= element_text(face="bold",color="black",size=13),
        legend.position = "top",
        axis.text = element_text(color="black",size=12),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.major.x = element_blank())

p15 <- ggplot(heart_disease_final, aes(x=output))+
  geom_bar(aes(fill = restecg), show.legend = TRUE, position="dodge")+
  labs(x="",y="", title = "Resting electrocardiographic results", fill="")+
  scale_y_continuous(breaks = seq(0, 300, by = 50), limits = c(0, 300))+
  theme_pubclean()+
  geom_text(aes(fill=restecg, label = ..count..), stat = "count", vjust = 1.2, colour = "black",
            position = position_dodge(.9))+
  theme(axis.title = element_text(face="bold",color="black",size=13),
        legend.position = "top",
        axis.text.y = element_text(color="black",size=12),
        axis.text.x = element_text(color="black",size=12),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.major.x = element_blank())

p16 <- ggplot(heart_disease_final, aes(x=output))+
  geom_bar(aes(fill = exng), show.legend = TRUE, position="dodge")+
  labs(x="",y="", title = "Exercise induced angina", fill="")+
  scale_y_continuous(breaks = seq(0, 350, by = 50), limits = c(0, 350))+
  theme_pubclean()+
  geom_text(aes(fill=exng, label = ..count..), stat = "count", vjust = 1.2, colour = "black",
            position = position_dodge(.9))+
  theme(axis.title = element_text(face="bold",color="black",size=13),
        legend.position = "top",
        axis.text = element_text(color="black",size=12),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.major.x = element_blank())


p17 <- ggplot(heart_disease_final, aes(x=output))+
  geom_bar(aes(fill = slp), show.legend = TRUE, position="dodge")+
  labs(x="",y="", title = "Slope of peak exercise", fill="")+
  scale_y_continuous(breaks = seq(0, 350, by = 50), limits = c(0, 350))+
  theme_pubclean()+
  geom_text(aes(fill=slp, label = ..count..), stat = "count", vjust = 0.5, colour = "black",
            position = position_dodge(.9))+
  theme(axis.title = element_text(face="bold",color="black",size=13),
        legend.position = "top",
        axis.text = element_text(color="black",size=12),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.major.x = element_blank())

figure3 <- ggarrange( p12, p13, p14, p15, p16, p17, 
          ncol = 3, nrow = 2)

annotate_figure(figure3,
                top = text_grob("Distribution by Heart disease of various categorical features",
                                color = "red", face = "bold", size = 15),
                bottom = text_grob("Data source: \n heart disease data set", color = "blue",
                                   hjust = 1, x = 0.98, face = "italic", size = 10),
                #left = text_grob("Figure arranged using ggpubr", color = "green", rot = 90),
                #right = "",
                fig.lab = "Figure 3", fig.lab.face = "bold"
                )

p18 <- ggplot(heart_disease_final, aes(x=age, fill=output)) + 
  geom_density(alpha=0.4)+
  scale_y_continuous(n.breaks = 10)+
  scale_x_continuous(breaks = seq(25, 80, by = 5), limits = c(25, 80))+
  labs(x="age",y="density", title = "Age of the patient")+
  theme(axis.title = element_text(color="black",size=12),
        legend.position = "top",
        axis.text = element_text(color="black",size=11),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank())+
  guides(fill = guide_legend(title = ""))


p19 <-ggplot(heart_disease_final, aes(x=trtbps, fill=output)) + 
  geom_density(alpha=0.4)+
  scale_y_continuous(n.breaks = 10)+
  scale_x_continuous(breaks = seq(90, 200, by = 10), limits = c(90, 200))+
  labs(x="Resting blood pressure (in mm Hg)",y="density", title = "Resting blood pressure")+
  theme(axis.title = element_text(color="black",size=12),
        legend.position = "top",
        axis.text.y = element_text(color="black",size=11),
        axis.text.x = element_text(color="black",size=9, angle = 90),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank())+
  guides(fill = guide_legend(title = ""))


p20 <-ggplot(heart_disease_final, aes(x=chol, fill=output)) + 
  geom_density(alpha=0.4)+
  scale_y_continuous(n.breaks = 10)+
  scale_x_continuous(breaks = seq(50, 650, by = 50), limits = c(50, 650))+
  labs(x="cholestoral in mg/dl",y="density", title = "cholestoral")+
  theme(axis.title = element_text(color="black",size=12),
        legend.position = "top",
        axis.text.y = element_text(color="black",size=11),
        axis.text.x = element_text(color="black",size=9, angle = 90),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank())+
  guides(fill = guide_legend(title = ""))


p21 <-ggplot(heart_disease_final, aes(x=thalach, fill=output)) + 
  geom_density(alpha=0.4)+
  scale_y_continuous(n.breaks = 10)+
  scale_x_continuous(breaks = seq(60, 210, by = 10), limits = c(60, 210))+
  labs(x="Maximum heart rate achieved",y="density", title = "Maximum heart rate")+
  theme(axis.title = element_text(color="black",size=12),
        legend.position = "top",
        axis.text.y = element_text(color="black",size=11),
        axis.text.x = element_text(color="black",size=9, angle=90),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank())+
  guides(fill = guide_legend(title = ""))


p22 <- ggplot(heart_disease_final, aes(output, age))+
  geom_boxplot(aes(colour = output), outlier.colour = "black", 
               outlier.shape = 8, show.legend = FALSE)+
  labs(x="",y="age", title = "Age of the patient")+
  scale_y_continuous(breaks = seq(25, 80, by = 5), limits = c(25, 80))+
  theme(axis.title = element_text(face="bold",color="black",size=13),
        axis.text = element_text(color="black",size=12),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.y = element_blank())


p23 <- ggplot(heart_disease_final, aes(output, trtbps))+
  geom_boxplot(aes(colour = output), outlier.colour = "black", 
               outlier.shape = 1, show.legend = FALSE)+
  labs(x="",y="Resting blood pressure (mm Hg)", title = "Resting blood pressure")+
  scale_y_continuous(breaks = seq(90, 200, by = 10), limits = c(90, 200))+
  theme(axis.title = element_text(face="bold",color="black",size=13),
        axis.text = element_text(color="black",size=12),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.y = element_blank())


p24 <-ggplot(heart_disease_final, aes(output, chol))+
  geom_boxplot(aes(colour = output), outlier.colour = "black", 
               outlier.shape = 1, show.legend = FALSE)+
  labs(x="",y="cholestoral in mg/dl", title = "cholestoral")+
  scale_y_continuous(breaks = seq(50, 650, by = 50), limits = c(50, 650))+
  theme(axis.title = element_text(face="bold",color="black",size=13),
        axis.text = element_text(color="black",size=12),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.y = element_blank())


p25 <-ggplot(heart_disease_final, aes(output, thalach))+
  geom_boxplot(aes(colour = output), outlier.colour = "black", 
               outlier.shape = 1, show.legend = FALSE)+
  labs(x="",y="Maximum heart rate achieved", title = "Maximum heart rate")+
  scale_y_continuous(breaks = seq(60, 200, by = 10), limits = c(60, 200))+
  theme(axis.title = element_text(face="bold",color="black",size=13),
        axis.text = element_text(color="black",size=12),
        plot.title = element_text(hjust = 0.5, face="bold",color="black",size=13),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.y = element_blank())

figure4 <- ggarrange( p18, p19, p20, p21, p22, p23, p24, p25, 
          ncol = 4, nrow = 2)

annotate_figure(figure4,
                top = text_grob("Distribution by Heart disease of various continous features",
                                color = "red", face = "bold", size = 15),
                bottom = text_grob("Data source: \n heart disease data set", color = "blue",
                                   hjust = 1, x = 0.98, face = "italic", size = 10),
                #left = text_grob("Figure arranged using ggpubr", color = "green", rot = 90),
                #right = "",
                fig.lab = "Figure 4", fig.lab.face = "bold"
                )

3.2.3 Correlation

A negative correlation implies that the two variables under consideration vary in opposite directions, that is, if a variable increases the other decreases and vice versa. On the other hand, a positive correlation implies that the two variables under consideration vary in the same direction, i.e., if a variable increases the other one increases and if one decreases the other one decreases as well

# improved correlation matrix
library(corrplot)

corrplot(cor(heart_disease_final%>%mutate(across(c(2,3,6, 7, 9, 11, 12), as.numeric)),
             method='spearman'),
  method = "color", #number
  addCoef.col = "black",
  number.cex = 0.95,
  type = "upper" # show only upper side #full
)

Only correlations with p-values smaller than the significance level (p<0.05)should be interpreted.

# correlation tests for whole dataset
library(Hmisc)
res <- rcorr(as.matrix(heart_disease_final%>%mutate(across(c(2,3,6, 7, 9, 11, 12), as.numeric)))) # rcorr() accepts matrices only

# display p-values (rounded to 3 decimals)
kable(
  round(res$P, 3)
)

	age	sex	cp	trtbps	chol	fbs	restecg	thalach	exng	oldpeak	slp	output
age	NA	0.264	0.000	0.000	0.109	0.000	0.000	0.000	0.000	0.000	0.000	0.000
sex	0.264	NA	0.000	0.349	0.003	0.009	0.713	0.000	0.000	0.001	0.000	0.000
cp	0.000	0.000	NA	0.235	0.029	0.236	0.032	0.000	0.000	0.000	0.000	0.000
trtbps	0.000	0.349	0.235	NA	0.009	0.000	0.090	0.001	0.000	0.000	0.000	0.000
chol	0.109	0.003	0.029	0.009	NA	0.141	0.064	0.588	0.018	0.110	0.032	0.005
fbs	0.000	0.009	0.236	0.000	0.141	NA	0.000	0.005	0.003	0.129	0.000	0.000
restecg	0.000	0.713	0.032	0.090	0.064	0.000	NA	0.488	0.194	0.001	0.003	0.002
thalach	0.000	0.000	0.000	0.001	0.588	0.005	0.488	NA	0.000	0.000	0.000	0.000
exng	0.000	0.000	0.000	0.000	0.018	0.003	0.194	0.000	NA	0.000	0.000	0.000
oldpeak	0.000	0.001	0.000	0.000	0.110	0.129	0.001	0.000	0.000	NA	0.000	0.000
slp	0.000	0.000	0.000	0.000	0.032	0.000	0.003	0.000	0.000	0.000	NA	0.000
output	0.000	0.000	0.000	0.000	0.005	0.000	0.002	0.000	0.000	0.000	0.000	NA

After exploring the data, a statistical model will be used to give the final conclusion of the relationship. The dependent variable, “output” is categorical with two variables while the independent variables are more than one and are a mixture of categorical and continuous data and hence a binary logistic model will be applied. This model will be trained and tested and used to make some predictions.

4 Logistic Regression

4.1 Splitting the data for training and testing

spliting the data in a 80:20 ratio (training:testing). We use set.seed() to make sure that the results are repeatable. We also use the outcome variable, output to stratify. This is to ensure that the distribution of the outcome is comparable in both data sets.

library(rsample)

set.seed(123)
split <- initial_split(heart_disease_final, prop = 0.8, strata = output)


train <- training(split) 
test <- testing(split)

#initial_time_split() takes the 1st prop samples for training,instead of random selection.

#sample=sample.split(mydata, SplitRatio = 0.80)
#train<-subset(mydata, sample==TRUE)
#test<-subset(mydata, sample==FALSE)

4.2 model

glm_logistic_reg <-glm(formula=output~.,data=train, family="binomial")


#summary(glm_logistic_reg)

4.3 model statistics

broom::glance(glm_logistic_reg)

## # A tibble: 1 x 8
##   null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
##           <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
## 1          828.     597  -198.  429.  499.     397.         582   598

#performance::check_collinearity(glm_logistic_reg)

performance::check_model(glm_logistic_reg)

#performance::performance_accuracy(glm_logistic_reg)

#plot(performance::performance_roc(glm_logistic_reg))

4.4 model results

tbl_regression(glm_logistic_reg, exponentiate = TRUE, 
    pvalue_fun = ~style_pvalue(.x, digits = 3)
  )%>%
  #add_global_p()%>% # add global p-value for categorical variables
  bold_p(t= 0.05) %>% # bold p-values under a given threshold (default 0.05)
  bold_labels() %>%
  italicize_levels()%>% 
  modify_header(label = "**Logistic regression**")%>% # update the column header
  add_significance_stars(
    pattern = "{estimate} ({conf.low}-{conf.high}){stars}",
    hide_ci = TRUE, hide_se = TRUE , hide_p = FALSE) %>%
  modify_header(estimate ~ "**AOR (95% CI)**") %>%
  modify_footnote(estimate ~ "AOR = Adjusted Odds Ratio, CI = Confidence Interval", abbreviation = TRUE)

Logistic regression	AOR (95% CI)^1,2	p-value
Age of the patient	1.03 (1.00-1.06)	0.069
Sex of the patient
Female	—
Male	5.08 (2.62-10.2)***	<0.001
Chest pain type
typical angina	—
atypical angina	0.76 (0.23-2.48)	0.641
non-anginal	0.87 (0.30-2.56)	0.801
asymptomatic	3.88 (1.40-11.3)*	0.010
Resting blood pressure (in mm Hg)	1.01 (0.99-1.03)	0.266
Cholestoral in mg/dl	1.00 (1.00-1.01)	0.098
Fasting blood sugar
<= 120mg/dl	—
>120mg/dl	1.52 (0.75-3.12)	0.251
Resting electrocardiographic results
normal	—
having ST_T abnormality	0.95 (0.45-2.02)	0.885
showing probable or definate left ventricular hypertrophy	1.24 (0.65-2.36)	0.510
Maximum heart rate achieved	1.00 (0.99-1.01)	0.936
Exercise induced angina
No	—
Yes	2.43 (1.33-4.42)**	0.004
ST depression	1.50 (1.11-2.06)**	0.010
Slope of peak exercise
upsloping	—
flat	10.9 (6.10-20.2)***	<0.001
downsloping	3.76 (1.17-12.4)*	0.027
¹ p<0.05; p<0.01; **p<0.001
² AOR = Adjusted Odds Ratio, CI = Confidence Interval

Statistically significant variables are Male(Sex), Asymptomatic (chest pain type), Yes (Angina), ST depression, flat (slope of peak exercise) and downsloping (slope of peak exercise).

Having Asymptomatic chest pain type, Angina, flat slope of peak exercise, downsloping slope of peak exercise, high ST depression and being Male are significantly associated with higher odds of having a heart attack. e.g People with Angina are 2.43 times more likely to get heart attack than those with no Angina. Males are 5.08 times more likely to get heart attack than females.

atypical angina chest pain type, non-anginal chest pain type, having ST_T abnormality sesting electrocardiographic results are associated with lower odds of heart attack. However they do not have a significant effect on heart attack.

5 Evaluating our Model

5.1 Fitting with testing data

test$prediction <-predict(glm_logistic_reg, test, type= "response")

##replacing probabilities with dummies 

test$predicted <- as.factor(ifelse(test$prediction<0.5, "No heart attack", "Has Heart attack"))

test <- test%>%
  mutate(predicted= factor(predicted, levels = c("No heart attack", "Has Heart attack")))

test1 <- test%>%
  mutate(across(c(predicted, output), fct_rev))
  
##creating a confusion matrix

table(Predicted=test1$predicted, Actual=test1$output)

##                   Actual
## Predicted          Has Heart attack No heart attack
##   Has Heart attack               68              14
##   No heart attack                 3              63

##getting percentage of misclassification

1-sum(diag(table(Predicted=test1$predicted, Actual=test1$output)))/sum(table(Predicted=test1$predicted, Actual=test1$output))

## [1] 0.1148649

##getting percentage of accuracy
sum(diag(table(Predicted=test1$predicted, Actual=test1$output)))/sum(table(Predicted=test1$predicted, Actual=test1$output))

## [1] 0.8851351

##Testing for goodness of fit. 

with(glm_logistic_reg, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE))

## [1] 2.407305e-82

5.2 Confussion matrix

library(caret)
confusionMatrix(test1$predicted, test1$output)

## Confusion Matrix and Statistics
## 
##                   Reference
## Prediction         Has Heart attack No heart attack
##   Has Heart attack               68              14
##   No heart attack                 3              63
##                                           
##                Accuracy : 0.8851          
##                  95% CI : (0.8225, 0.9316)
##     No Information Rate : 0.5203          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.7713          
##                                           
##  Mcnemar's Test P-Value : 0.01529         
##                                           
##             Sensitivity : 0.9577          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.8293          
##          Neg Pred Value : 0.9545          
##              Prevalence : 0.4797          
##          Detection Rate : 0.4595          
##    Detection Prevalence : 0.5541          
##       Balanced Accuracy : 0.8880          
##                                           
##        'Positive' Class : Has Heart attack
##

True positives (TP) we have 68 correctly classified instances of heart attack
True negatives(TN) we have 63 cases classified as not having heart attack
False negatives (FN) we have 3 cases our model said the patient didn’t have heart attack and they actually did
False positives (FP) we have 14 cases our model said a patient did have heart attack, but they actually didn’t
Sensitivity (Recall) The true positive rate is 0.9577 meaning from all the patients that had heart attack ; how many did we predict correctly - this equation is Sensitivity =TP / (TP + FN)
Specificity The true negative rate is 0.8182 meaning from all the patients that didn’t have heart attack - how many did we predict correctly - this equation is Specificity =TN / (TN + FP)
Positive Predictive Value (Precision) is 0.8293 meaning from all the classes we predicted as positive, how many were actually positive. equation is Positive Predictive Value=TP/(TP + FP)
Negative Predictive Value is 0.9545 meaning from all the classes we predicted as negative, how many were actually negative. equation is Negative Predictive Value=TN/(TN + FN)
Accuracy of the model is 88.5% with p value of 2.407305e-82 (less than the default level of significance (0.05)).

5.3 ROC curve

plot(performance::performance_roc(glm_logistic_reg, new_data = test))

Our model did relatively well and can be used in making predictions

Predicting heart disease