This vignette details why the stranded_model
dataset was created, how to load it, and gives examples of use with the caret Machine Learning library.
The dataset contains:
library(NHSRdatasets)
library(dplyr)
library(ggplot2)
library(caret)
library(rsample)
library(varhandle)
data("stranded_data")
glimpse(stranded_data)
#> Rows: 768
#> Columns: 9
#> $ stranded.label <chr> "Not Stranded", "Not Stranded", "Not Stranded~
#> $ age <int> 50, 31, 32, 69, 33, 75, 26, 64, 53, 63, 30, 7~
#> $ care.home.referral <int> 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, ~
#> $ medicallysafe <int> 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, ~
#> $ hcop <int> 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, ~
#> $ mental_health_care <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, ~
#> $ periods_of_previous_care <int> 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 4, ~
#> $ admit_date <chr> "29/12/2020", "11/12/2020", "19/01/2021", "07~
#> $ frailty_index <chr> "No index item", "No index item", "No index i~
prop.table(table(stranded_data$stranded.label))
#>
#> Not Stranded Stranded
#> 0.5963542 0.4036458
This is good, it shows a relatively even split between the not stranded and stranded labels. Please refer to the webinar on Advanced Modelling to look at how you can deal with classification imbalance using techniques such as SMOTE (Synthetic Minority Oversampling Technique Estimation) and ROSE (Random Oversampling Estimation), to name a few.
The next step will be to decide which features need to be engineered for our machine learning model. We will drop the admit_date and recode the frailty index, and perhaps allocate the age into age bands.
<- stranded_data %>%
stranded_data ::mutate(stranded.label=factor(stranded.label)) %>%
dplyr::select(everything(), -c(admit_date)) dplyr
Next, I will select the categorical variables and make these into dummy variables, i.e. a numerical encoding of a categorical variable:
<- select_if(stranded_data, is.character)
cats <- varhandle::to.dummy(cats$frailty_index, "frail_ind")
cat_dummy #Converts the frailty index column to dummy encoding and sets a column called "frail_ind" prefix
<- cat_dummy %>%
cat_dummy as.data.frame() %>%
::select(-frail_ind.No_index_item) #Drop the field of interest
dplyr# Drop the frailty index from the stranded data frame and bind on our new encoding categorical variables
<- stranded_data %>%
stranded_data ::select(-frailty_index) %>%
dplyrbind_cols(cat_dummy) %>% na.omit(.)
The data is now ready for splitting into a simple train and validation split, to do the machine learning on the set.
The next step is to create a simple hold out train/test split:
<- rsample::initial_split(stranded_data, prop = 3/4)
split <- rsample::training(split)
train <- rsample::testing(split) test
The next step will be to create a stranded classification model, in CARET:
set.seed(123)
<- caret::train(factor(stranded.label) ~ ., data = train,
glm_class_mod method = "glm")
print(glm_class_mod)
#> Generalized Linear Model
#>
#> 525 samples
#> 9 predictor
#> 2 classes: 'Not Stranded', 'Stranded'
#>
#> No pre-processing
#> Resampling: Bootstrapped (25 reps)
#> Summary of sample sizes: 525, 525, 525, 525, 525, 525, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.7769949 0.43233
This is a very basic model and could be improved by model choice, hyperparameter selection, different resampling strategies, etc.
Next, we will use the test dataset to see how our model will perform in the wild:
<- predict(glm_class_mod, newdata = test) # Predict class
preds <- predict(glm_class_mod, newdata = test, type="prob") #Predict probs
pred_prob
# Join prediction on to actual test data frame and evaluate in confusion matrix
<- data.frame(preds, pred_prob)
predicted <- test %>%
test bind_cols(predicted) %>%
::rename(pred_class=preds)
dplyr
glimpse(test)
#> Rows: 174
#> Columns: 13
#> $ stranded.label <fct> Not Stranded, Not Stranded, Not Strande~
#> $ age <int> 26, 70, 75, 67, 68, 43, 76, 62, 22, 78,~
#> $ care.home.referral <int> 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, ~
#> $ medicallysafe <int> 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, ~
#> $ hcop <int> 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, ~
#> $ mental_health_care <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, ~
#> $ periods_of_previous_care <int> 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, ~
#> $ frail_ind.Activity_Limitation <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ~
#> $ frail_ind.Fall_patient_history <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ~
#> $ frail_ind.Mobility_problems <dbl> 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, ~
#> $ pred_class <fct> Not Stranded, Not Stranded, Not Strande~
#> $ Not.Stranded <dbl> 7.366174e-01, 7.610988e-01, 7.264844e-0~
#> $ Stranded <dbl> 0.2633826, 0.2389012, 0.2735156, 0.2232~
The final step is to evaluate the model:
::confusionMatrix(test$stranded.label, test$pred_class, positive="Stranded")
caret#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Not Stranded Stranded
#> Not Stranded 110 3
#> Stranded 35 26
#>
#> Accuracy : 0.7816
#> 95% CI : (0.7128, 0.8406)
#> No Information Rate : 0.8333
#> P-Value [Acc > NIR] : 0.9699
#>
#> Kappa : 0.4545
#>
#> Mcnemar's Test P-Value : 4.934e-07
#>
#> Sensitivity : 0.8966
#> Specificity : 0.7586
#> Pos Pred Value : 0.4262
#> Neg Pred Value : 0.9735
#> Prevalence : 0.1667
#> Detection Rate : 0.1494
#> Detection Prevalence : 0.3506
#> Balanced Accuracy : 0.8276
#>
#> 'Positive' Class : Stranded
#>
The model performs relatively well and could be improved by better predictors, a bigger sample and class imbalance techniques.
This dataset can be used for a number of classification problems and can be the NHS’s equivalent to the iris dataset for classification, albeit this only works for binary classification problems.