Supervised Learning with Prostate Cancer

Final for Supervised Learning: Classification

Rohan Lewis

2021.07.13

I. Introduction

Prostate health is a serious and widespread concern for men in the US. This data set explores some response variables from a sample dataset.

II. Data

For my final project, I chose the prostate cancer data from Datasets for "The Elements of Statistical Learning.

1. Packages

2. Read Data

Details can be found here.

3. Objective

The objective of this study is to predict various outcomes. Here are the questions that will be answered.

III. Exploratory Data Analysis

There are no missing values and several variables have already been log transformed.

1. Distributions and Histograms

2. Skew

Age is left skewed.

logCV, logPSA, and logPW have been appropriately log transformed and are close to normal distribution.

logBPH, logCP, and Gleason2 all have over 40% of their observations in the least value bin.

Gleason1 and SVI are both categorical and are unbalanced.

3. Correlation

Gleason1 and Gleason2 have the highest correlation at around 0.75.

logCV and logPSA, logCP and logCV, and logCP and SVI also have high correlations.

No variables will be removed, as multicollinearity is not an issue.

IV. Classification Models

1. Outcome Variables.

When set as the outcome variable,

The chart below summarizes the corresponding labels created:

Prostate-Specific Antigen (ng/mL) log(PSA) Output Classification Label
$< 4$ $< 1.3863$ Low Risk
$4 - 10$ $1.3863 - 2.3026$ Borderline Risk
$> 10$ $> 2.3026$ High Risk

2. Train Test Split

The data will be split into training (70%) and test (30%) sets. Then for each of the 3 outcome variables, the appropriate variables will be kept as predictors and set as the outcome.

3. Models

I am using three supervised learning algorithms:

Each model will be run in a validation curve simulation to optimize its respective parameter/hyperparameter.

Data entered to KNN will always have been scaled beforehand.

In addition, each model will be run for all three training sets.

4. Decision Tree (Random Forest)

i. Setup

30 decision trees using a maximum depth of 2 to 10 will be evaluated to fit a final model.

ii. Gleason Score

A maximum depth of 2 will be used.

iii. Prostate-Specific Antigen

A maximum depth of 6 will be used.

iv. Seminal Vesicle Invasion

A maximum depth of 3 will be used.

5. K-Nearest Neighbors

i. Setup

Number of Nearest Neighbors from 1 to 44 will be evaluated to fit a final model.

ii. Gleason Score

$N = 7$ will be used.

iii. Prostate-Specific Antigen

$N = 15$ will be used.

iv. Seminal Vesicle Invasion

$N = 3$ will be used.

6. Support Vector Machine (Polynomial of degree 3)

i. Setup

A support vector classifier with a polynomial of degree three kernal will be used. Hyperparameter $C$ values from $10^{−5}$ to $10^5$ will be evaluated to fit a final model.

ii. Gleason Score

$C = 10^{2}$ will be used.

iii. Prostate-Specific Antigen

$C = 10^{2}$ will be used.

iv. Seminal Vesicle Invasion

$C = 10^{2}$ will be used.

IV. Results

1. Function

V. Confusion Matrices and Predictions