# 06.23.13

## Heart-Disease Predictor Using Logistic Regression

*Probability is the very guide of life.
- Cicero*

Given a two-column dataset, column one being age and column two being the presence/absence of heart-disease, we build a model (in R) that predicts the probability of heart-disease at an age. For a realistic model we aught to have big datasets with additional predictor variables such as blood-pressure, cholesterol, diabetes, smoking etc. However, the one-and-only predictor variable we have is age and the sample-size is 100 subjects!

Plotting the data (see below) doesn’t really provide a clear picture of the nature of the relationship between heart-disease and age. The problem is that the response variable (presence/absence of heart disease) is binary.

Let’s create intervals of the independent variable (age) and compute the frequency of occurrence of the response variable (presence/absence of heart disease). You can get the table below here.

A short and lucid tutorial in logistic regression is here (text) and here (video). The logistic curve is an S-shaped curve that takes the form,

y = [exp(b_{0} + b_{1}x)] / [1 + exp(b_{0} + b_{1}x)]

Clearly, the curve is *non-linear*, but the logit-transform makes it linear.

logit(y) = b_{0} + b_{1}x

Thus, logistic regression is linear regression on the logit transform of y, where y is the probability of success at each value of x. Logistic regression fits b0 and b1, the regression coefficients.

The glm package in R is used to fit generalized regression models and can be used for logistic regression by specifying the **family** parameter to be **binomial** with the logit link like so:

> glm.out = glm(cbind(chd.present, chd.absent) ~ age.mean, + family=binomial(logit), data=frequency.coronary.data)

Plotting the fit shows us the close relationship between the fitted values and the observed values.

Below is the R code that generated the plots.

rm(list=ls()) coronary.data <- read.table("http://www.shatterline.com/MachineLearning/data/AGE-CHD-Y-N.txt", header=TRUE) plot(CHD ~ Age, data=coronary.data, col="red") title(main="Scatterplot of presence/absence of \ncoronary heart disease by age \nfor 100 subjects") library(calibrate) #needed to label observation frequency.coronary.data <- read.table("http://www.shatterline.com/MachineLearning/data/frequency-table-of-age-group-by-chd.txt", header=TRUE) frequency.coronary.data[,"age.mean"] <- (frequency.coronary.data$age.start + frequency.coronary.data$age.end)/2 frequency.coronary.data <- frequency.coronary.data[, c(1,2,6,3,4,5)] #reorder cols #With "family=" set to "binomial" with a "logit" link, # glm( ) produces a logistic regression glm.fit = glm(cbind(chd.present, chd.absent) ~ age.mean, family=binomial(logit), data=frequency.coronary.data) summary(glm.fit) plot(chd.present/age.group.total ~ age.mean, data=frequency.coronary.data) lines(frequency.coronary.data$age.mean, glm.fit$fitted, type="l", col="red") textxy(frequency.coronary.data$age.mean, frequency.coronary.data$chd.present/frequency.coronary.data$age.group.total, frequency.coronary.data$age.mean, cx=0.6) title(main="Percentage of subjects with heart disease in each age group")

Created by Pretty R at inside-R.org

References

- http://www.youtube.com/watch?v=qSTHZvN8hzs&list=WL980F0C0E5B4CD53D#t=24m03s
- http://ww2.coastal.edu/kingw/statistics/R-tutorials/logistic.html
- Applied Logistic Regression, David W. Hosmer, Jr., Stanley Lemeshow, Rodney X. Sturdivant