The data set used for prediction has 19,622 observations with 160 variables. This set represents various measurements of movement from various postitions on the body. There were six participants, and they were asked to perform dumbbell bicep curls.
I created a model and used it to determine the class of execution for 20 observations.
The five classes are as follows:
For more information, here is the page of Human Activity Recognition.
Many of the variables have many NAs. 67 variables were eliminated for having NAs. 93 were retained for having no NAs.
As someone who has been training regularly for many years, I am familiar with exercise form and biomechanics.
The most important and unique predictor is the participant. People have different proportions and do exercises slightly differently than one another. However, the way an individual performs over time is unique to that individual.
Referring to the five classes of execution, it seems to me that stability is extremely important. The additional predictors I chose are thus the pitch, roll, and yaw of the dumbbell, forearm, arm, and belt. Any movement (and lack of movement) detected in any of the three dimensions for these four regions should accurately predict form and be unique for an individual.
Fortunately these none of these predictors were eliminated by removing NAs, which led me to believe I should have a reasonable model with these predictors.
90% of the data set was used to create a model. The remaining 10% of the data set was used to estimate the out of sample error.
I chose a 10-fold cross validation with random forest to derive a suitable model.
The confusion matrix is displayed in the Appendix, and the out of sample error rate is approximately 0.36%.
The results are displayed in the Appendix.
dim(movement_data)
[1] 19622 160
print(unique(movement_data$user_name), max.levels = 0)
[1] carlitos pedro adelmo charles eurico jeremy
na_count <- sapply(movement_data, function(x) sum(is.na(x)))
na_df <- data.frame(na_count)
to_retain <- subset(na_df, na_count == 0)
new_movement_data <- movement_data[, row.names(to_retain)]
table(na_df$na_count)
0 19216
93 67
pry_names <- names(new_movement_data)[grep("^pitch_|^roll_|^yaw_", names(new_movement_data))]
predictors <- c("user_name", pry_names, "classe")
new_movement_data <- new_movement_data[, predictors]
prediction_data <- prediction_data[, predictors[-14]]
predictors
[1] "user_name" "roll_belt" "pitch_belt" "yaw_belt"
[5] "roll_arm" "pitch_arm" "yaw_arm" "roll_dumbbell"
[9] "pitch_dumbbell" "yaw_dumbbell" "roll_forearm" "pitch_forearm"
[13] "yaw_forearm" "classe"
set.seed(12321)
in_train <- createDataPartition(y = new_movement_data$classe, p = 0.9, list = FALSE)
training <- new_movement_data[in_train,]
testing <- new_movement_data[-in_train,]
set.seed(32123)
train.control <- trainControl(method = "cv", number = 10)
model <- train(classe ~ ., data = training, method = "rf", trControl = train.control)
pred <- predict(model, testing)
confusion_matrix <- table(pred, testing$classe)
confusion_matrix
pred A B C D E
A 558 1 0 0 0
B 0 377 1 0 0
C 0 1 341 2 1
D 0 0 0 319 1
E 0 0 0 0 358
(sum(confusion_matrix) - sum(diag(confusion_matrix))) / sum(confusion_matrix)
[1] 0.003571429
print(predict(model, prediction_data), max.levels = 0)
[1] B A B A A E D B A A B C B A E E A B B B