Complete code can be found here. Click graphs and plots for full size.
According to the American Cancer Society,
Other than skin cancer, prostate cancer is the most common cancer in American men. The American Cancer Society’s estimates for prostate cancer in the United States for 2021 are:
In addition:
The objective of this study is cluster groups of men based off metrics related to prostate and reproductive health.
Similarities within clusters and differences between clusters will be discussed.
The data chosen was from a statistical set. The prostate data contains 97 observations and 9 metrics. They are as follows:
The original study can be found here.
Here are some Key Statistics for Prostate Cancer from the American Cancer Society.
Clustering groups of men together that have similar characteristics, differentiating them from other clusters, may prove to be useful in various treatments in various stages of illnesses related to men's health.
There were no missing values and several variables had already been log transformed.
Age
is left skewed.
logCV
, logPSA
, and logPW
have been appropriately log transformed and are close to normal distribution.
logBPH
, logCP
, and Gleason2
all have over 40% of their observations in the least value bin.
Gleason1
and SVI
are both categorical and are unbalanced.
Gleason1
and Gleason2
have the highest correlation at around 0.75.
logCV
and logPSA
, logCP
and logCV
, and logCP
and SVI
also have high correlations.
No variables will be removed, as multicollinearity is not an issue.
The data will be split into training (70%) and validation (30%) sets.
I am using three usupervised learning algorithms:
For each of the three models, the cluster results of the train set will be represented as the color in a grid of pair-plots for all nine variables.
The distortion and inertia imply n = 3 clusters is an appropriate value.
Pair-plots are shown below.
Cluster 2 has noticeable, tight peaks for logBPH
, logPW
, Gleason2
, Gleason1
, and SVI
. Most of the variables have much overlap among the clusters.
Below is a graphic of how the observations in the training set clustered together.
The colors of the above dendogram match with the clusters in the pair-plots shown below.
Again, most of the variables have much overlap among the clusters.
Pair-plots are shown below
MeanShift resulted in only two clusters. Again, most of the variables have much overlap among the clusters.
For all three models, the clusters seem to be most noticeably correlated with Gleason2
. This is clear not only from the scatter plots of variable pairs, but the distinct separations in the distribution plot of Gleason2
along the diagonal.
Although there were a few peaks for the other variables, the overall distributions were generally highly overlapping. The scatter plots also had no emerging patterns.
I choose Hierarchical Agglomerative Clustering as my final model. I chose it because I could set the cluster size from the dendogram and because it is recommended for uneven cluster sizes.
The validation pair-plots are shown below.
Again, the clusters are noticeably correlated with Gleason2
. Perhaps Gleason2
is a good 'average' metric of the rest, which would explain more variance of the other variables among the clusters.
The dataset was only 97 observations and 9 variables. Only 67 observations were used in the clustering algorithms. More observations would have more diversity and perhaps more distinct clusters emerging.
Besides a few hours of reading some websites, I have very little subject knowledge of prostate health. I am certain there are many interactions, and perhaps other variables that I am unaware of which would be useful in clustering
Several variables were already log transformed. Would keeping their original values be more useful in clustering?
I did not fully test different affinities, linkages, or cluster numbers for the Hierarchical Agglomerative Clustering Algorithm. It is possible that some other combination of these parameters/hyperparameters would have more accurate clusters for diagnosis and treatment. A more in-depth look at the algorithm with subject knowledge would be very useful for better clustering.