Supervised Learning with Prostate Cancer¶

Final for Supervised Learning: Classification¶

Rohan Lewis¶

2021.07.13¶

Complete code can be found here. Click graphs and plots for full size.

I. Introduction¶

According to the American Cancer Society,

Other than skin cancer, prostate cancer is the most common cancer in American men. The American Cancer Society’s estimates for prostate cancer in the United States for 2021 are:

About 248,530 new cases of prostate cancer.
About 34,130 deaths from prostate cancer.

In addition:

About 1 man in 8 will be diagnosed with prostate cancer during his lifetime.
Prostate cancer is the second leading cause of cancer death in American men, behind only lung cancer.
About 1 man in 41 will die of prostate cancer.

II. Main Objectives¶

The objective of this study is to predict various outcomes. Here are some questions that will be answered.

Can Seminal Vesicle Invasion be predicted from other variables?
Can Gleason Score be predicted from other variables?
Can Prostate-Specific Antigen levels be predicted from other variables?

III. Data¶

The data chosen was from a statistical set. The prostate data contains 97 observations and 9 metrics. They are as follows:

lcavol - log cancer volume
lweight - log prostate weight
age - in years
lbph - log of the amount of benign prostatic hyperplasia
svi - seminal vesicle invasion
lcp- log of capsular penetration
gleason - a numeric vector
pgg45 - percent of Gleason score 4 or 5
lpsa - log prostate-specific antigen

The original study can be found here.

Here are some Key Statistics for Prostate Cancer from the American Cancer Society.

IV. Exploratory Data Analysis¶

There were no missing values and several variables had already been log transformed.

1. Distributions and Histograms¶

Age is left skewed.

logCV, logPSA, and logPW have been appropriately log transformed and are close to normal distribution.

logBPH, logCP, and Gleason2 all have over 40% of their observations in the least value bin.

Gleason1 and SVI are both categorical and are unbalanced.

2. Correlations¶

Gleason1 and Gleason2 have the highest correlation at around 0.75.

logCV and logPSA, logCP and logCV, and logCP and SVI also have high correlations.

No variables will be removed, as multicollinearity is not an issue.

V. Outcome Variables and Classification Algorithms¶

When set as the outcome variable,

Gleason Score will be classified as 6, 7 or 8 or 9
Prostate-Specific Antigen is positively correlated with Prostate Cancer, however there is no set cutoff.
Seminal Vesicle Invasion already has binary classification labels, so will remain unchanged

The chart below summarizes the corresponding labels created:

Prostate-Specific Antigen (ng/mL)	log(PSA)	Output Classification Label
$< 4$	$< 1.3863$	Low Risk
$4 - 10$	$1.3863 - 2.3026$	Borderline Risk
$> 10$	$> 2.3026$	High Risk

The data will be split into training (70%) and test (30%) sets. Then for each of the 3 outcome variables, the appropriate variables will be kept as predictors and set as the outcome.

I am using three supervised learning algorithms:

Decision Tree (Random Forest)
K-Nearest Neighbors
Support Vector Machine (Polynomial of degree 3)

Each model will be run in a validation curve simulation to optimize its respective parameter/hyperparameter.

Data entered to KNN will always have been scaled beforehand.

In addition, each model will be run for all three training sets.

1. Decision Tree (Random Forest)¶

The results, that is the accuracy of the train and validation curves for each outcome variable are shown below, along with maximum depth based off the curves.

a. Gleason Score¶

A maximum depth of 2 will be used.

b. Prostate Specific Antigen¶

A maximum depth of 2 will be used.

c. Seminal Vesicle Invasion¶

A maximum depth of 3 will be used.

2. K-Nearest Neighbors¶

The results, that is the accuracy of the train and validation curves for each outcome variable are shown below, along with the number of neighbors based off the curves.

a. Gleason Score¶

$N = 7$ will be used.

b. Prostate Specific Antigen¶

$N = 15$ will be used.

c. Seminal Vesicle Invasion¶

$N = 3$ will be used.

3. Support Vector Machine (Polynomial of Degree 3)¶

The results, that is the accuracy of the train and validation curves for each outcome variable are shown below, along with the optimized hyperparameter $C$ based off the curves.

a. Gleason Score¶

$C = 10^{2}$ will be used.

b. Prostate Specific Antigen¶

$C = 10^{2}$ will be used.

c. Seminal Vesicle Invasion¶

$C = 10^{2}$ will be used.

VI. Results¶

For each of the three outcomes, the second row has the Accuracy, Precision, Recall, and F1 Score for the train and test sets from the three different ML models. The third row has the confusion matrices for the corresponding outcomes, train/test sests, and ML models.

Gleason Score	Prostate-Specific Antigen	Seminal Vesicle Invasion

VII. Final Model Selection¶

I chose Support Vector Machine for Gleason Score, I chose Random Forest for Prostate-Specific Antigen, and I chose K-Nearest Neighbors for Seminal Vesicle Invasion.

Gleason Score and Seminal Vesicle invasion had very good accuracy. Prostate-Specific Antigen was reasonable, but not as high as the others.

VIII. Conclusion, Next Steps, and Shortcomings¶

The dataset was only 97 observations and 8 predictors for each outcome. Only 67 observations were used in the algorithms. More observations would have more diversity and more predictive power. I am extremely curious if ethnicity is an important predictor for these outcomes. I imagine diet, exercise, and smoking habits would also be interesting to look at.

Besides a few hours of reading some websites, I have very little subject knowledge of prostate health. I am certain there are many interactions, as well as causal relationships, that I am unaware of. Having this subject knowledge would be useful.

Several variables were already log transformed. Would keeping their original values be more useful in prediction?

After doing a 70-30 split, I was limited to a 3-fold cross validation, as Gleason Score had only a few observations with values of $8$ or $9$.

It is important to note I converted Prostate-Specific Antigen to a categorical scale of 3 labels. With a larger data set, perhaps a regression model could be utilized.

A larger dataset, as mentioned in the beginning of this section, would be useful in having more observations in the lesser filled bins of several predictors. This could ultimately pinpoint a better hyperparameter value and thus more appropriate models for prediction.