Supervised Learning with Prostate Cancer

Final for Supervised Learning: Classification

Rohan Lewis

2021.07.13

Complete code can be found here. Click graphs and plots for full size.

I. Introduction

According to the American Cancer Society,

Other than skin cancer, prostate cancer is the most common cancer in American men. The American Cancer Society’s estimates for prostate cancer in the United States for 2021 are:

In addition:

II. Main Objectives

The objective of this study is to predict various outcomes. Here are some questions that will be answered.

III. Data

The data chosen was from a statistical set. The prostate data contains 97 observations and 9 metrics. They are as follows:

The original study can be found here.

Here are some Key Statistics for Prostate Cancer from the American Cancer Society.

IV. Exploratory Data Analysis

There were no missing values and several variables had already been log transformed.

1. Distributions and Histograms


Age is left skewed.

logCV, logPSA, and logPW have been appropriately log transformed and are close to normal distribution.

logBPH, logCP, and Gleason2 all have over 40% of their observations in the least value bin.

Gleason1 and SVI are both categorical and are unbalanced.

2. Correlations


Gleason1 and Gleason2 have the highest correlation at around 0.75.

logCV and logPSA, logCP and logCV, and logCP and SVI also have high correlations.

No variables will be removed, as multicollinearity is not an issue.

V. Outcome Variables and Classification Algorithms

When set as the outcome variable,

The chart below summarizes the corresponding labels created:

Prostate-Specific Antigen (ng/mL) log(PSA) Output Classification Label
$< 4$ $< 1.3863$ Low Risk
$4 - 10$ $1.3863 - 2.3026$ Borderline Risk
$> 10$ $> 2.3026$ High Risk

The data will be split into training (70%) and test (30%) sets. Then for each of the 3 outcome variables, the appropriate variables will be kept as predictors and set as the outcome.

I am using three supervised learning algorithms:

Each model will be run in a validation curve simulation to optimize its respective parameter/hyperparameter.

Data entered to KNN will always have been scaled beforehand.

In addition, each model will be run for all three training sets.

1. Decision Tree (Random Forest)

The results, that is the accuracy of the train and validation curves for each outcome variable are shown below, along with maximum depth based off the curves.

a. Gleason Score

A maximum depth of 2 will be used.

b. Prostate Specific Antigen

A maximum depth of 2 will be used.

c. Seminal Vesicle Invasion

A maximum depth of 3 will be used.

2. K-Nearest Neighbors

The results, that is the accuracy of the train and validation curves for each outcome variable are shown below, along with the number of neighbors based off the curves.

a. Gleason Score

$N = 7$ will be used.

b. Prostate Specific Antigen

$N = 15$ will be used.

c. Seminal Vesicle Invasion

$N = 3$ will be used.

3. Support Vector Machine (Polynomial of Degree 3)

The results, that is the accuracy of the train and validation curves for each outcome variable are shown below, along with the optimized hyperparameter $C$ based off the curves.

a. Gleason Score

$C = 10^{2}$ will be used.

b. Prostate Specific Antigen

$C = 10^{2}$ will be used.

c. Seminal Vesicle Invasion

$C = 10^{2}$ will be used.

VI. Results

For each of the three outcomes, the second row has the Accuracy, Precision, Recall, and F1 Score for the train and test sets from the three different ML models. The third row has the confusion matrices for the corresponding outcomes, train/test sests, and ML models.

Gleason Score Prostate-Specific Antigen Seminal Vesicle Invasion

VII. Final Model Selection

I chose Support Vector Machine for Gleason Score, I chose Random Forest for Prostate-Specific Antigen, and I chose K-Nearest Neighbors for Seminal Vesicle Invasion.

Gleason Score and Seminal Vesicle invasion had very good accuracy. Prostate-Specific Antigen was reasonable, but not as high as the others.

VIII. Conclusion, Next Steps, and Shortcomings

The dataset was only 97 observations and 8 predictors for each outcome. Only 67 observations were used in the algorithms. More observations would have more diversity and more predictive power. I am extremely curious if ethnicity is an important predictor for these outcomes. I imagine diet, exercise, and smoking habits would also be interesting to look at.

Besides a few hours of reading some websites, I have very little subject knowledge of prostate health. I am certain there are many interactions, as well as causal relationships, that I am unaware of. Having this subject knowledge would be useful.

Several variables were already log transformed. Would keeping their original values be more useful in prediction?

After doing a 70-30 split, I was limited to a 3-fold cross validation, as Gleason Score had only a few observations with values of $8$ or $9$.

It is important to note I converted Prostate-Specific Antigen to a categorical scale of 3 labels. With a larger data set, perhaps a regression model could be utilized.

A larger dataset, as mentioned in the beginning of this section, would be useful in having more observations in the lesser filled bins of several predictors. This could ultimately pinpoint a better hyperparameter value and thus more appropriate models for prediction.