Probability is the very guide of life.
Given a two-column dataset, column one being age and column two being the presence/absence of heart-disease, we build a model (in R) that predicts the probability of heart-disease at an age. For a realistic model we aught to have big datasets with additional predictor variables such as blood-pressure, cholesterol, diabetes, smoking etc. However, the one-and-only predictor variable we have is age and the sample-size is 100 subjects!
Plotting the data (see below) doesn’t really provide a clear picture of the nature of the relationship between heart-disease and age. The problem is that the response variable (presence/absence of heart disease) is binary.
Let’s create intervals of the independent variable (age) and compute the frequency of occurrence of the response variable (presence/absence of heart disease). You can get the table below here.
Clearly, the curve is non-linear, but the logit-transform makes it linear.
logit(y) = b0 + b1x
Thus, logistic regression is linear regression on the logit transform of y, where y is the probability of success at each value of x. Logistic regression fits b0 and b1, the regression coefficients.
The glm package in R is used to fit generalized regression models and can be used for logistic regression by specifying the family parameter to be binomial with the logit link like so:
> glm.out = glm(cbind(chd.present, chd.absent) ~ age.mean, + family=binomial(logit), data=frequency.coronary.data)
Plotting the fit shows us the close relationship between the fitted values and the observed values.
Below is the R code that generated the plots.
rm(list=ls()) coronary.data <- read.table("http://www.shatterline.com/MachineLearning/data/AGE-CHD-Y-N.txt", header=TRUE) plot(CHD ~ Age, data=coronary.data, col="red") title(main="Scatterplot of presence/absence of \ncoronary heart disease by age \nfor 100 subjects") library(calibrate) #needed to label observation frequency.coronary.data <- read.table("http://www.shatterline.com/MachineLearning/data/frequency-table-of-age-group-by-chd.txt", header=TRUE) frequency.coronary.data[,"age.mean"] <- (frequency.coronary.data$age.start + frequency.coronary.data$age.end)/2 frequency.coronary.data <- frequency.coronary.data[, c(1,2,6,3,4,5)] #reorder cols #With "family=" set to "binomial" with a "logit" link, # glm( ) produces a logistic regression glm.fit = glm(cbind(chd.present, chd.absent) ~ age.mean, family=binomial(logit), data=frequency.coronary.data) summary(glm.fit) plot(chd.present/age.group.total ~ age.mean, data=frequency.coronary.data) lines(frequency.coronary.data$age.mean, glm.fit$fitted, type="l", col="red") textxy(frequency.coronary.data$age.mean, frequency.coronary.data$chd.present/frequency.coronary.data$age.group.total, frequency.coronary.data$age.mean, cx=0.6) title(main="Percentage of subjects with heart disease in each age group")
- Applied Logistic Regression, David W. Hosmer, Jr., Stanley Lemeshow, Rodney X. Sturdivant