Complete code can be found here. Click graphs and plots for full size.
According to the American Cancer Society,
Other than skin cancer, prostate cancer is the most common cancer in American men. The American Cancer Society’s estimates for prostate cancer in the United States for 2021 are:
In addition:
The objective of this study is to predict various outcomes. Here are some questions that will be answered.
The data chosen was from a statistical set. The prostate data contains 97 observations and 9 metrics. They are as follows:
The original study can be found here.
Here are some Key Statistics for Prostate Cancer from the American Cancer Society.
There were no missing values and several variables had already been log transformed.
Age
is left skewed.
logCV
, logPSA
, and logPW
have been appropriately log transformed and are close to normal distribution.
logBPH
, logCP
, and Gleason2
all have over 40% of their observations in the least value bin.
Gleason1
and SVI
are both categorical and are unbalanced.
Gleason1
and Gleason2
have the highest correlation at around 0.75.
logCV
and logPSA
, logCP
and logCV
, and logCP
and SVI
also have high correlations.
No variables will be removed, as multicollinearity is not an issue.
When set as the outcome variable,
The chart below summarizes the corresponding labels created:
Prostate-Specific Antigen (ng/mL) | log(PSA) | Output Classification Label |
---|---|---|
$< 4$ | $< 1.3863$ | Low Risk |
$4 - 10$ | $1.3863 - 2.3026$ | Borderline Risk |
$> 10$ | $> 2.3026$ | High Risk |
The data will be split into training (70%) and test (30%) sets. Then for each of the 3 outcome variables, the appropriate variables will be kept as predictors and set as the outcome.
I am using three supervised learning algorithms:
Each model will be run in a validation curve simulation to optimize its respective parameter/hyperparameter.
Data entered to KNN will always have been scaled beforehand.
In addition, each model will be run for all three training sets.
The results, that is the accuracy of the train and validation curves for each outcome variable are shown below, along with maximum depth based off the curves.
A maximum depth of 2 will be used.
A maximum depth of 2 will be used.
A maximum depth of 3 will be used.
The results, that is the accuracy of the train and validation curves for each outcome variable are shown below, along with the number of neighbors based off the curves.
$N = 7$ will be used.
$N = 15$ will be used.
$N = 3$ will be used.
The results, that is the accuracy of the train and validation curves for each outcome variable are shown below, along with the optimized hyperparameter $C$ based off the curves.
$C = 10^{2}$ will be used.
$C = 10^{2}$ will be used.
$C = 10^{2}$ will be used.
For each of the three outcomes, the second row has the Accuracy, Precision, Recall, and F1 Score for the train and test sets from the three different ML models. The third row has the confusion matrices for the corresponding outcomes, train/test sests, and ML models.
Gleason Score | Prostate-Specific Antigen | Seminal Vesicle Invasion |
---|---|---|
I chose Support Vector Machine for Gleason Score
, I chose Random Forest for Prostate-Specific Antigen
, and I chose K-Nearest Neighbors for Seminal Vesicle Invasion
.
Gleason Score
and Seminal Vesicle invasion
had very good accuracy. Prostate-Specific Antigen
was reasonable, but not as high as the others.
The dataset was only 97 observations and 8 predictors for each outcome. Only 67 observations were used in the algorithms. More observations would have more diversity and more predictive power. I am extremely curious if ethnicity is an important predictor for these outcomes. I imagine diet, exercise, and smoking habits would also be interesting to look at.
Besides a few hours of reading some websites, I have very little subject knowledge of prostate health. I am certain there are many interactions, as well as causal relationships, that I am unaware of. Having this subject knowledge would be useful.
Several variables were already log transformed. Would keeping their original values be more useful in prediction?
After doing a 70-30 split, I was limited to a 3-fold cross validation, as Gleason Score
had only a few observations with values of $8$ or $9$.
It is important to note I converted Prostate-Specific Antigen
to a categorical scale of 3 labels. With a larger data set, perhaps a regression model could be utilized.
A larger dataset, as mentioned in the beginning of this section, would be useful in having more observations in the lesser filled bins of several predictors. This could ultimately pinpoint a better hyperparameter value and thus more appropriate models for prediction.