Clustering with Prostate Cancer

Final for Unsupervised Learning

Rohan Lewis

2021.03.24

Complete code can be found here. Click graphs and plots for full size.

I. Introduction

According to the American Cancer Society,

Other than skin cancer, prostate cancer is the most common cancer in American men. The American Cancer Society’s estimates for prostate cancer in the United States for 2021 are:

In addition:

II. Main Objectives

The objective of this study is cluster groups of men based off metrics related to prostate and reproductive health.

Similarities within clusters and differences between clusters will be discussed.

III. Data

The data chosen was from a statistical set. The prostate data contains 97 observations and 9 metrics. They are as follows:

The original study can be found here.

Here are some Key Statistics for Prostate Cancer from the American Cancer Society.

Clustering groups of men together that have similar characteristics, differentiating them from other clusters, may prove to be useful in various treatments in various stages of illnesses related to men's health.

IV. Exploratory Data Analysis

There were no missing values and several variables had already been log transformed.

1. Distributions and Histograms


Age is left skewed.

logCV, logPSA, and logPW have been appropriately log transformed and are close to normal distribution.

logBPH, logCP, and Gleason2 all have over 40% of their observations in the least value bin.

Gleason1 and SVI are both categorical and are unbalanced.

2. Correlations


Gleason1 and Gleason2 have the highest correlation at around 0.75.

logCV and logPSA, logCP and logCV, and logCP and SVI also have high correlations.

No variables will be removed, as multicollinearity is not an issue.

V. Clustering Models

The data will be split into training (70%) and validation (30%) sets.

I am using three usupervised learning algorithms:

For each of the three models, the cluster results of the train set will be represented as the color in a grid of pair-plots for all nine variables.

1. KMeans Clustering

The distortion and inertia imply n = 3 clusters is an appropriate value.



Pair-plots are shown below.



Cluster 2 has noticeable, tight peaks for logBPH, logPW, Gleason2, Gleason1, and SVI. Most of the variables have much overlap among the clusters.

2. Hierarchical Agglomerative Clustering

Below is a graphic of how the observations in the training set clustered together.



The colors of the above dendogram match with the clusters in the pair-plots shown below.



Again, most of the variables have much overlap among the clusters.

3. MeanShift Clustering

Pair-plots are shown below



MeanShift resulted in only two clusters. Again, most of the variables have much overlap among the clusters.

VI. Key Findings

For all three models, the clusters seem to be most noticeably correlated with Gleason2. This is clear not only from the scatter plots of variable pairs, but the distinct separations in the distribution plot of Gleason2 along the diagonal.

Although there were a few peaks for the other variables, the overall distributions were generally highly overlapping. The scatter plots also had no emerging patterns.

VII. Final Model Selection

I choose Hierarchical Agglomerative Clustering as my final model. I chose it because I could set the cluster size from the dendogram and because it is recommended for uneven cluster sizes.

The validation pair-plots are shown below.



Again, the clusters are noticeably correlated with Gleason2. Perhaps Gleason2 is a good 'average' metric of the rest, which would explain more variance of the other variables among the clusters.

VIII. Conclusion, Next Steps, and Shortcomings

The dataset was only 97 observations and 9 variables. Only 67 observations were used in the clustering algorithms. More observations would have more diversity and perhaps more distinct clusters emerging.

Besides a few hours of reading some websites, I have very little subject knowledge of prostate health. I am certain there are many interactions, and perhaps other variables that I am unaware of which would be useful in clustering

Several variables were already log transformed. Would keeping their original values be more useful in clustering?

I did not fully test different affinities, linkages, or cluster numbers for the Hierarchical Agglomerative Clustering Algorithm. It is possible that some other combination of these parameters/hyperparameters would have more accurate clusters for diagnosis and treatment. A more in-depth look at the algorithm with subject knowledge would be very useful for better clustering.