Complete code can be found here. Click graphs and plots for full size.
According to the American Cancer Society,
Other than skin cancer, prostate cancer is the most common cancer in American men. The American Cancer Society’s estimates for prostate cancer in the United States for 2021 are:
In addition:
The objective of this study is to predict various outcomes. Here are some questions that will be answered.
The data chosen was from a statistical set. The prostate data contains 97 observations and 9 metrics. They are as follows:
The original study can be found here.
Here are some Key Statistics for Prostate Cancer from the American Cancer Society.
There were no missing values and several variables had already been log transformed.
Age is left skewed.
logCV, logPSA, and logPW have been appropriately log transformed and are close to normal distribution.
logBPH, logCP, and Gleason2 all have over 40% of their observations in the least value bin.
Gleason1 and SVI are both categorical and are unbalanced.
Gleason1 and Gleason2 have the highest correlation at around 0.75.
logCV and logPSA, logCP and logCV, and logCP and SVI also have high correlations.
No variables will be removed, as multicollinearity is not an issue.
When set as the outcome variable,
The chart below summarizes the corresponding labels created:
| Prostate-Specific Antigen (ng/mL) | log(PSA) | Output Classification Label |
|---|---|---|
| $< 4$ | $< 1.3863$ | Low Risk |
| $4 - 10$ | $1.3863 - 2.3026$ | Borderline Risk |
| $> 10$ | $> 2.3026$ | High Risk |
The data will be split into training (70%) and test (30%) sets. Then for each of the 3 outcome variables, the appropriate variables will be kept as predictors and set as the outcome.
I am using three supervised learning algorithms:
Each model will be run in a validation curve simulation to optimize its respective parameter/hyperparameter.
Data entered to KNN will always have been scaled beforehand.
In addition, each model will be run for all three training sets.
The results, that is the accuracy of the train and validation curves for each outcome variable are shown below, along with maximum depth based off the curves.
A maximum depth of 2 will be used.
A maximum depth of 2 will be used.
A maximum depth of 3 will be used.
The results, that is the accuracy of the train and validation curves for each outcome variable are shown below, along with the number of neighbors based off the curves.
$N = 7$ will be used.
$N = 15$ will be used.
$N = 3$ will be used.
The results, that is the accuracy of the train and validation curves for each outcome variable are shown below, along with the optimized hyperparameter $C$ based off the curves.
$C = 10^{2}$ will be used.
$C = 10^{2}$ will be used.
$C = 10^{2}$ will be used.
For each of the three outcomes, the second row has the Accuracy, Precision, Recall, and F1 Score for the train and test sets from the three different ML models. The third row has the confusion matrices for the corresponding outcomes, train/test sests, and ML models.
| Gleason Score | Prostate-Specific Antigen | Seminal Vesicle Invasion |
|---|---|---|
![]() |
![]() |
![]() |
|
|
|
I chose Support Vector Machine for Gleason Score, I chose Random Forest for Prostate-Specific Antigen, and I chose K-Nearest Neighbors for Seminal Vesicle Invasion.
Gleason Score and Seminal Vesicle invasion had very good accuracy. Prostate-Specific Antigen was reasonable, but not as high as the others.
The dataset was only 97 observations and 8 predictors for each outcome. Only 67 observations were used in the algorithms. More observations would have more diversity and more predictive power. I am extremely curious if ethnicity is an important predictor for these outcomes. I imagine diet, exercise, and smoking habits would also be interesting to look at.
Besides a few hours of reading some websites, I have very little subject knowledge of prostate health. I am certain there are many interactions, as well as causal relationships, that I am unaware of. Having this subject knowledge would be useful.
Several variables were already log transformed. Would keeping their original values be more useful in prediction?
After doing a 70-30 split, I was limited to a 3-fold cross validation, as Gleason Score had only a few observations with values of $8$ or $9$.
It is important to note I converted Prostate-Specific Antigen to a categorical scale of 3 labels. With a larger data set, perhaps a regression model could be utilized.
A larger dataset, as mentioned in the beginning of this section, would be useful in having more observations in the lesser filled bins of several predictors. This could ultimately pinpoint a better hyperparameter value and thus more appropriate models for prediction.