Clustering with Prostate Cancer

Final for Unsupervised Learning

Rohan Lewis

2021.03.24

I. Introduction

Prostate health is a serious and widespread concern for men in the US. This data set explores some response variables from a sample dataset.

II. Data

For my final project, I chose the prostate cancer data from Datasets for "The Elements of Statistical Learning.

1. Packages

2. Read Data

Details can be found here.

3. Objective

The objective of this study is cluster groups of men based off metrics related to prostate and reproductive health.

Similarities within clusters and differences between clusters will be discussed.

III. Exploratory Data Analysis

There are no missing values and several variables have already been log transformed.

1. Distributions and Histograms

2. Skew

Age is left skewed.

logCV, logPSA, and logPW have been appropriately log transformed and are close to normal distribution.

logBPH, logCP, and Gleason2 all have over 40% of their observations in the least value bin.

Gleason1 and SVI are both categorical and are unbalanced.

3. Correlation

Gleason1 and Gleason2 have the highest correlation at around 0.75.

logCV and logPSA, logCP and logCV, and logCP and SVI also have high correlations.

No variables will be removed, as multicollinearity is not an issue.

IV. Clustering Models

The data will be split into training (70%) and validation (30%) sets.

I am using three usupervised learning algorithms:

For each of the three models, the cluster results of the train set will be represented as the color in a grid of pair-plots for all nine variables.

1. KMeans Clustering

2. Hierarchical Agglomerative Clustering

3. MeanShift Clustering

V. Final Model Selection

Hierarchical Agglomerative Clustering was chosen as the final model.