The data set used for prediction has 19,622 observations with 160 variables. This set represents various measurements of movement from various postitions on the body. There were six participants, and they were asked to perform dumbbell bicep curls.
I created a model and used it to determine the class of execution for 20 observations.
The five classes are as follows:
For more information, here is the page of Human Activity Recognition.
Many of the variables have many NAs. 67 variables were eliminated for having NAs. 93 were retained for having no NAs.
As someone who has been training regularly for many years, I am familiar with exercise form and biomechanics.
The most important and unique predictor is the participant. People have different proportions and do exercises slightly differently than one another. However, the way an individual performs over time is unique to that individual.
Referring to the five classes of execution, it seems to me that stability is extremely important. The additional predictors I chose are thus the pitch, roll, and yaw of the dumbbell, forearm, arm, and belt. Any movement (and lack of movement) detected in any of the three dimensions for these four regions should accurately predict form and be unique for an individual.
Fortunately these none of these predictors were eliminated by removing NAs, which led me to believe I should have a reasonable model with these predictors.
90% of the data set was used to create a model. The remaining 10% of the data set was used to estimate the out of sample error.
I chose a 10-fold cross validation with random forest to derive a suitable model.
The confusion matrix is displayed in the Appendix, and the out of sample error rate is approximately 0.36%.
The results are displayed in the Appendix.
dim(movement_data)[1] 19622   160print(unique(movement_data$user_name), max.levels = 0)[1] carlitos pedro    adelmo   charles  eurico   jeremy  na_count <- sapply(movement_data, function(x) sum(is.na(x)))
na_df <- data.frame(na_count)
to_retain <- subset(na_df, na_count == 0)
new_movement_data <- movement_data[, row.names(to_retain)]
table(na_df$na_count)
    0 19216 
   93    67 pry_names <- names(new_movement_data)[grep("^pitch_|^roll_|^yaw_", names(new_movement_data))]
predictors <- c("user_name", pry_names, "classe")
new_movement_data <- new_movement_data[, predictors]
prediction_data <- prediction_data[, predictors[-14]]
predictors [1] "user_name"      "roll_belt"      "pitch_belt"     "yaw_belt"      
 [5] "roll_arm"       "pitch_arm"      "yaw_arm"        "roll_dumbbell" 
 [9] "pitch_dumbbell" "yaw_dumbbell"   "roll_forearm"   "pitch_forearm" 
[13] "yaw_forearm"    "classe"        set.seed(12321)
in_train <- createDataPartition(y = new_movement_data$classe, p = 0.9, list = FALSE)
training <- new_movement_data[in_train,]
testing <- new_movement_data[-in_train,]set.seed(32123)
train.control <- trainControl(method = "cv", number = 10)
model <- train(classe ~ ., data = training, method = "rf", trControl = train.control)pred <- predict(model, testing)
confusion_matrix <- table(pred, testing$classe)
confusion_matrix    
pred   A   B   C   D   E
   A 558   1   0   0   0
   B   0 377   1   0   0
   C   0   1 341   2   1
   D   0   0   0 319   1
   E   0   0   0   0 358(sum(confusion_matrix) - sum(diag(confusion_matrix))) / sum(confusion_matrix)[1] 0.003571429print(predict(model, prediction_data), max.levels = 0) [1] B A B A A E D B A A B C B A E E A B B B