Note

This is a modified version of a report I created in Summer 2017 on behalf of another organization. The original data and the associated problem are non-confidential. Anything identifiable (for example, variable names) has been changed to a generic alternative.

Executive Summary

We performed a logistic lasso to create a logistic regression model for the data provided to us by Newton. Our best regression equation is the following:

\(\LARGE{\boxed{\hat{P} = \frac{e^{-0.25 - 0.19X_{73} + 0.0029X_{299} + 0.022X_{16} + 0.073X_{77} + 0.0054X_{169} + 0.010X_{157}}}{1+e^{-0.25 - 0.19X_{73} + 0.0029X_{299} + 0.022X_{16} + 0.073X_{77} + 0.0054X_{169} + 0.010X_{157}}}}}\)

If \(\hat{P} \geq 0.5\), we predict that \(Y=1\). If not, we predict \(Y=0\).

We outline our model creation process in the sections below. We were given questions to answer about our work. Short answers follow. More context for these answers is included in remainder of the report.

We estimate that our model has a misclassification rate of 0.39, which is better than the base rate of 0.50.
It is important to balance one’s dataset because a model could be tuned to whatever the dominant outcome variable is. The model would overcompensate for the more common variable, and eventually predict no better than the base rate.
We selected the most predictive independent variables through the logistic lasso optimization process in cv.glmnet().
We divided the data into build (training) and test sets to avoid over-fitting a model to a sub-optimal solution. We also used cross-validation for this reason.
We test model performance on a left-out test set. This is more appropriate and less optimistic than using mean cross-validated error, because if many models are tested, even cross-validation can select a sub-optimal model through random chance.
Our Random Forest model has a misclassification rate of 0.31. It makes better predictions than our logistic model, but takes longer to run on new data than a logistic model.
A multiple-stage model could lead to better predictions.

Exploratory Data Analysis (EDA)

As described by Newton, the outcome variable is \(Y\) and it is binary, taking either \(1\) or \(0\). The dataset has already been balanced, meaning there are equal numbers of \(Y=1\) observations as there are \(Y=0\). It is important to balance one’s dataset because a model could be tuned to whatever the dominant outcome variable is. If one outcome is much rarer than the other, a model could just predict the common outcome every time and have fairly high performance in terms of mean-squared error. Such a model would be less useful than a model created from a balanced dataset that, on paper, had worse mean-squared error performance.

table(newton$y)

## 
##   0   1 
## 236 236

Sample Splitting

We split the data into a main set and a left-out test set before doing any model selection. Later, for model selection, we do cross-validation. All future comments regarding EDA or model building solely consider the training set. Model tuning will not be done on the left-out test set. This avoids over-fitting: as the number of tested models approaches infinity, random chance will make a suboptimal model look like the best choice on a sample from the population of interest. Cross-validation alone cannot insure against overfitting if many models are tested.

Here, we sample rows that will be put into the left-out sample. We decide that 10% of rows will become the test set. Our sample is small enough that we do not want to use a larger percentage for the test set.

#random.org/integers/?num=1&min=1&max=21474836&col=5&base=10&format=html&rnd=new
set.seed(18046675)
samp.size <- .1 * nrow(newton)
y.rows <- which(newton$y == 1)
non.y.rows <- which(newton$y != 1)
#/2 for the y=0 rows to be sampled equally
y.row.samples <- sample(y.rows, size = samp.size/2, replace = FALSE)
non.y.row.samples <- sample(non.y.rows, size = samp.size/2, replace = FALSE)
row.samples <- c(y.row.samples, non.y.row.samples)

Next, we create the training and left-out samples. We also separate the \(Y\) column from the predictor columns. This is for convenience in generating summary statistics and EDA solely from the predictors and not \(Y\).

te.newton <- newton[row.samples,]
newton <- newton[-row.samples,]
y.newton <- as.factor(newton$y)
te.y.newton <- as.factor(te.newton$y)
te.newton <- te.newton[,-which(colnames(te.newton) == "y")]
newton <- newton[,-which(colnames(newton) == "y")]

We verify that the main and test sets both have equal numbers of \(Y=1\) and \(Y=0\) observations:

table(y.newton)

## y.newton
##   0   1 
## 213 213

table(te.y.newton)

## te.y.newton
##  0  1 
## 23 23

Quantiles

We are interested in the proportion of observations that are above zero for each predictor.

quants <- matrix(nrow = 3, ncol = ncol(newton))
for (i in 1:ncol(newton)){
  quants[1:3,i] <- quantile(newton[,i], probs = c(.5, .85, .86))
}
quantile.zeros <- numeric()
for (i in 1:nrow(quants)){
  quantile.zeros[i] <- 100 * length(which(quants[i,] == 0)) / ncol(quants)
}
names(quantile.zeros) <- c("Median", "85th Percentile", "86th Percentile")

We find our answer for quantiles of interest. 71% of predictors are median-zero. Below the 86th percentile, less than 50% of the predictors are individually above zero. These statistics indicate that many of the predictors are often zero.

round(quantile.zeros, 1)

##          Median 85th Percentile 86th Percentile 
##            71.3            52.1            49.4

Model Creation

Logistic Lasso

We have been asked to create a logistic regression to predict \(Y\) from the \(X_i\) predictors. We will use a logistic lasso (formerly known as logistic LASSO) to find useful predictors for predicting Y. If the lasso does well, it will reduce the number of not helpful or marginally helpful predictors in our model. This will reduce variance and improve predictive performance of the model on new data.

We also cross-validate in the model creation step. We choose 5-fold cross validation, due to the small size of our sample. \(k=10\) is another common choice for \(k\)-fold cross validation. We argue that the variance introduced by validating the model against a small subsample of size \(\frac{426}{k=10} \approx 43\) is not worth the reduction in model bias as compared to a hypothetical perfect model.

This raises a question related to our test set, which contains 46 observations. Why not choose a larger test set? Our test set is small because variance in the validation of the model is far less important than variance in the creation of the model. The validation step of cross-validation is still part of the model creation process. The test-set validation is outside of that process. Our estimate of model performance would have lower variance with a larger test set, but the quality of the model’s predictions would decrease as well. This is because there would be fewer observations to use while creating the model.

For \(\lambda\), we choose \(\lambda_{1SE}\). This is the largest \(\lambda\) value that is within 1 standard error of the \(\lambda\) that minimizes the mean cross-validated error. \(\lambda_{1SE}\) is less likely than the minimum value of \(\lambda\) to be overfitted to the sample used to create the model.

library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-10

library(doParallel)

## Loading required package: iterators

## Loading required package: parallel

registerDoParallel()
#family="binomial" for logistic regression
out.glm <- cv.glmnet(x = as.matrix(newton), y = y.newton, nfolds = 5, 
                     family="binomial", parallel = TRUE)
#for readability, we explicitly define s = "lambda.1se" even though this is the default
preds <- predict(out.glm, newx = as.matrix(te.newton), type="response", s = "lambda.1se")
obs.vs.expect <- cbind(te.y.newton, ifelse(preds >= .5, 1, 0))
confusion.mat <- table(obs.vs.expect[,1], obs.vs.expect[,2])

We extract the coefficients from our model and determine which coefficients are not equal to zero:

coefs.glm <- coef(out.glm, s = "lambda.1se")
nonzero.coefs <- which(coefs.glm != 0)

We verify that all other coefficients are 0 at our specified lambda value:

table(coefs.glm[-nonzero.coefs,])

## 
##   0 
## 437

Our best regression equation is of this form:

\(\LARGE{\boxed{\hat{P} = \frac{e^{a + bX}}{1+e^{a + bX}}}}\)

\(\hat{P}\) is the predicted probability that \(Y=1\).
\(e\) is the base of the natural log, around 2.71.
\(a\) is the intercept, around -0.25.
\(b\) is the slope corresponding to each predictor. Most \(b\) values are 0, but, for example, \(b_{73}\) = -0.19.
\(X\) is each predictor, from \(X_{1}\) through \(X_{443}\).

Substituting in our calculated values, and simplifying to remove zeros (most \(b\) values), we get the following equation, rounded to two significant figures.

If \(\hat{P} \geq 0.5\), we predict that \(Y=1\). If not, we predict \(Y=0\).

We rounded to two significant figures above. However, the following R code provides values for each non-zero \(b\), as well as the intercept \(a\), to a higher number of significant digits.

coefs.glm[nonzero.coefs,]

##  (Intercept)          x73         x299          x16          x77 
## -0.252138815 -0.194817135  0.002854973  0.021978266  0.073051477 
##         x169         x157 
##  0.005416745  0.010194576

The above code indicates just 6 predictors and 1 intercept, from an initial pool of 443 possible predictors. Clearly, the lasso reduced the number of potential predictors in the model. It is not surprising that many predictors were removed in the lasso process: as described in the EDA section, many predictor observations were predominantly zero. Since many predictors were almost entirely the same observed value, they could not be used very well to distinguish between \(Y=0\) and \(Y=1\).

Removing these unneeded predictors reduces variance and thus error when predicting with new data. The 6 predictors used in the lasso model are the most predictive independent variables for a logistic model. We selected them through the logistic lasso optimization process in cv.glmnet().

There are multiple ways for a logistic regression package in R to describe the relationship between predictors and the response variable \(Y\). The package we used, glmnet, even has multiple modes for its functions. It is wise to verify that the equation we describe is how the package glmnet is operating. To do this, we manually calculate \(\hat{P}\) from the formula above, and compare it to the output of predict().

orig.data.preds <- predict(out.glm, newx = as.matrix(newton),
                           type="response", s = "lambda.1se")

#notation follows this image: http://faculty.cas.usf.edu/mbrannick/regression/gifs/lo3.gif
test_logistic_formula <- function(row){
  vectorized.products <- coefs.glm[nonzero.coefs,][2:7] * 
    newton[row,nonzero.coefs-1]
  bx <- sum(vectorized.products)
  a <- coefs.glm[nonzero.coefs,][1]
  p.test <- exp(a + bx)/(1 + exp(a + bx))
  c(p.test, orig.data.preds[row])
}

preds.through.two.methods <- matrix(ncol=2, nrow=nrow(newton))
for (i in 1:nrow(newton)){
  preds.through.two.methods[i,1:2] <- test_logistic_formula(row = i)[1:2]
}

We confirm that the above regression equation matches the internal formula used by glmnet:

all.equal(target = preds.through.two.methods[,1], current = preds.through.two.methods[,2])

## [1] TRUE

We generate a confusion matrix for the logistic model.

val_from_confusion_mat <- function(mat){
  #columns of mat: expected. rows of mat: observed
  true.negative <- mat[1,1]
  false.positive <- mat[1,2]
  false.negative <- mat[2,1]
  true.positive <- mat[2,2]
  total <- sum(mat)
  misclass.rate <- (false.positive + false.negative) / total
  purity <- true.positive / (true.positive + false.positive)
  completeness <- true.positive / (true.positive + false.negative)
  out.vec <- c(misclass.rate, completeness, purity)
  names(out.vec) <- c("Misclassification Rate", "Completeness", "Purity")
  return(out.vec)
}

The base rate is 50%. This makes sense because of how the data were collected.

base.rate <- length(which(y.newton != 0))/length(y.newton)
base.rate

## [1] 0.5

We calculate some values of interest from the matrix, using the left-out test set that was not used in the cross-validation process. This is more appropriate and less optimistic than using mean cross-validated error, because if many models are tested, even cross-validation can select a sub-optimal model through random chance. A sub-optimal model would have a better mean cross-validated error than other candidate models. However, when tested on new data, it would have worse performance than suggested by its cross-validated error.

We have no information on whether \(Y=0\) or \(Y=1\) is more important to predict, so we assume they are equally important. This makes misclassification rate a logical choice for measuring model performance.

logistic.conf <- val_from_confusion_mat(confusion.mat)
logistic.conf

## Misclassification Rate           Completeness                 Purity 
##              0.3913043              0.7826087              0.5806452

Potential Gains From a More Complex Model

We briefly consider a Random Forest model. Random Forest is more complicated than logistic regression or logistic lasso. A Random Forest outperforms our best logistic model by 5.8 percentage points.

library("randomForest")

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

rf.newton <- randomForest(x=newton, y=y.newton, xtest = te.newton, ytest = te.y.newton)
rf.conf <- val_from_confusion_mat(rf.newton$confusion)
rf.conf

## Misclassification Rate           Completeness                 Purity 
##              0.3328125              0.6572770              0.6698565

perc.point.diff = 100 * (logistic.conf[1] - rf.conf[1])
names(perc.point.diff) = "Percentage Point Difference, Logistic-RF"
perc.point.diff

## Percentage Point Difference, Logistic-RF 
##                                 5.849185

This suggests that the Random Forest could implicitly include interactions that the logistic lasso does not consider. Compared to logistic lasso, the Random Forest model has a superior misclassification rate, but identifies a smaller percentage of all possible \(Y=1\) cases. Company goals could determine the appropriateness of one model over the other. Random Forest has a longer run time, and is less well-suited for real-time or big-data contexts.

Conclusion and Future Work

Our approach has room for improvement. It has superior performance to simply guessing \(Y=1\) at the base rate. (That is, predicting \(Y=1\) for 50% of any new observations.) Within the limitations of this sample (a small sample with unlabeled predictors), the most likely source of predictive improvement is with other model-building methods.

With more time, we would examine the Random Forest and try to determine how it achieves superior predictive performance to the logistic model. This would help us choose interaction terms that might get the logistic model closer to the Random Forest model’s predictive ability.

A multiple-stage classifier could be worth investigating. A first stage could try to remove cases that are obviously \(Y=0\) or \(Y=1\) from the prediction process. This would allow a second stage to be more finely-tuned for difficult observations in which the difference between \(Y=0\) and \(Y=1\) is harder to determine.

Even without improving the model’s predictive ability, the model could be made more useful. This depends on the model’s intended application. If \(Y=1\) is very costly for a business, the model could be adjusted to have multiple tiers. The tiers could be low-risk, medium-risk, and high-risk, or something similar. The tiers would correspond to where a predicted \(\hat{P}\) lies in the interval \([0, 1]\) for each new set of observed predictors \(X_1 ... X_{443}\).

If there is an intervention that could be applied to the medium-risk pool that reduces the predicted likelihood of \(Y=1\), then a business could use the intervention on the medium-risk pool, use a different treatment for the high-risk pool, and a different intervention or perhaps no intervention on the low-risk pool.

Statistics Report - Newton

Andy Manka