Machine Learning and Wine
Who says wine can’t be paired with Machine Learning?
Summary:
Ever wondered what would happen if we combined machine learning and wine together? In this fun little project, we attempt to use support vector machines using R to classify the quality of various red and white Portuguese “Vinho Verde” wines on a scale of 1-10 (1 being the worst to 10 being the best) using only the physiochemical properties of the wines. The full project involved using both support vector machines and neural networks to compare the differences in performance. To view more the project in it’s entirety, click on the github badge above. However, for the purposes of this post, we will focus primarily on how to deploy support vector machines.
Data:
The Wine Dataset can be found on the UCI Machine Learning Repository and includes 6,497 observations with 12 variables in total. In order to train the models, the data is split into three datasets using random sampling methods: Training, Validation, and Testing.
Step 1: Load Packages and Read in the Data
Let’s begin by the loading the necessary packages for this project.
library(ggplot2)
library(tidyverse)
library(kernlab)
library(e1071)
library(caret)
As a quick summary of each package, ggplot2 is used for visualization, tidyverse is a data cleaning tool, kernlab package contains multiple kernel based machine learning models including the support vector model used in this project, and the caret package contains miscellaneous functions for training and plotting classification and regression models.
Next, we read in the data.
white_wine <- read.csv("winequality-white.csv", sep = ";")
red_wine <- read.csv("winequality-red.csv", sep = ";")
Step 2: Inspect the data
In order to gain a better understanding of the data we are working with, it is important to explore the data. What type of data are we working with? Are there any missing values? How many observations are we working with? These questions provide a starting point when beginning a machine learning project.
The str function provides a quick and easy way to get a sense of the data.
str(white_wine)
    'data.frame':    4898 obs. of  12 variables:
    $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
    $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
    $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
    $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
    $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
    $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
    $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
    $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
    $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
    $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
    $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
    $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
str(red_wine)
    'data.frame':    1599 obs. of  12 variables:
    $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
    $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
    $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
    $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
    $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
    $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
    $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
    $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
    $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
    $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
    $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
    $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
Combine both red and white wine datasets.
wine_data <- rbind(white_wine, red_wine)
The summary function also provides quick insight into numerical data.
summary(wine_data)
    fixed.acidity    volatile.acidity  citric.acid     residual.sugar
    Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600
    1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800
    Median : 7.000   Median :0.2900   Median :0.3100   Median : 3.000
    Mean   : 7.215   Mean   :0.3397   Mean   :0.3186   Mean   : 5.443
    3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900   3rd Qu.: 8.100
    Max.   :15.900   Max.   :1.5800   Max.   :1.6600   Max.   :65.800
    chlorides       free.sulfur.dioxide total.sulfur.dioxide
    Min.   :0.00900   Min.   :  1.00      Min.   :  6.0
    1st Qu.:0.03800   1st Qu.: 17.00      1st Qu.: 77.0
    Median :0.04700   Median : 29.00      Median :118.0
    Mean   :0.05603   Mean   : 30.53      Mean   :115.7
    3rd Qu.:0.06500   3rd Qu.: 41.00      3rd Qu.:156.0
    Max.   :0.61100   Max.   :289.00      Max.   :440.0
    density             pH          sulphates         alcohol
    Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00
    1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50
    Median :0.9949   Median :3.210   Median :0.5100   Median :10.30
    Mean   :0.9947   Mean   :3.219   Mean   :0.5313   Mean   :10.49
    3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.30
    Max.   :1.0390   Max.   :4.010   Max.   :2.0000   Max.   :14.90
    quality
    Min.   :3.000
    1st Qu.:5.000
    Median :6.000
    Mean   :5.818
    3rd Qu.:6.000
    Max.   :9.000
Check for NA values using the anyNA function.
anyNA(wine_data)
    [1] FALSE
There are no missing values in the data.
Plot each variable against “quality” in a matrix to visualize the data. The code below was written using the pipe function %>% from (Dplyr)[https://dplyr.tidyverse.org] and is especially useful for writing clean looking code.
wine_data %>%
  gather(-quality, key = "variables", value = "value") %>%
  ggplot(aes(x = value, y = quality, color = variables)) +
  geom_point(alpha = 1/4) +
  facet_wrap(~ variables, scales = "free") +
  scale_fill_brewer(palette = "Set3",
                    name = "variables")
 
      
    
  
  
    Visualize the data to see the distribution of the various wine qualities.
wine_data$quality %>% table() %>%
  as.data.frame() %>%
  ggplot(aes(x = ., y = Freq)) +
  geom_bar(stat = "identity")
 
      
    
  
  
    Get the table counts for the number of observations for each quality of wine.
wine_data$quality %>% table()
    3    4    5    6    7    8    9
    30  216 2138 2836 1079  193    5
Based on the table above, wine qualities of three and nine only make up 0.54% of all the observations. Because of this, it may be better to filter out these observations as the model will not have sufficient data to train on in order to accurately classify them. Therefore, the project will focus on classifying wine qualities ranging from four to eight.
Filter out any observations with wine qualities of three or nine.
wine_data <- wine_data %>%
  filter(quality > 3 & quality < 9)
# Check to make sure observations were removed
unique(wine_data$quality)
    [1] 6 5 7 8 4
Support Vector Machines assume the data is within a standard range of 0 to 1. Therefore, SVMs require data to be normalized prior to training the model. The custom function below normalizes data.
# Function to normalize the data
normalize <- function(x) {
    return((x - min(x)) / (max(x) - min(x)))
}
# Remove rownames from the dataframe
row.names(wine_data) <- c()
wine_data_norm <- data.frame(lapply(wine_data, normalize))
# Check data to make sure variables were normalized
str(wine_data_norm)
    'data.frame':    6462 obs. of  12 variables:
    $ fixed.acidity       : num  0.264 0.207 0.355 0.281 0.281 ...
    $ volatile.acidity    : num  0.152 0.176 0.16 0.12 0.12 0.16 0.192 0.152 0.176 0.112 ...
    $ citric.acid         : num  0.217 0.205 0.241 0.193 0.193 ...
    $ residual.sugar      : num  0.3083 0.0153 0.0966 0.1212 0.1212 ...
    $ chlorides           : num  0.0598 0.0664 0.0681 0.0814 0.0814 ...
    $ free.sulfur.dioxide : num  0.32 0.0945 0.2109 0.3345 0.3345 ...
    $ total.sulfur.dioxide: num  0.485 0.373 0.269 0.533 0.533 ...
    $ density             : num  0.268 0.133 0.154 0.164 0.164 ...
    $ pH                  : num  0.217 0.45 0.419 0.364 0.364 ...
    $ sulphates           : num  0.129 0.152 0.124 0.101 0.101 ...
    $ alcohol             : num  0.116 0.217 0.304 0.275 0.275 ...
    $ quality             : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
# Confirm the data has been normalized and that there are still only 5 qualities of wine
unique(wine_data_norm$quality)
    [1] 0.50 0.25 0.75 1.00 0.00
Split the data into training, validation, and test datasets using random sampling.
# Set seed for duplication purposes
set.seed(123)
# Randomly sample and split the data
ss <- sample(1:3,
             size=nrow(wine_data_norm),
             replace=TRUE,
             prob=c(0.6,0.2,0.2))
train <- wine_data[ss==1,]
validation <- wine_data[ss==2,]
test <- wine_data[ss==3,]
# Datasets with normalized observations
train_norm <- wine_data_norm[ss==1,]
validation_norm <- wine_data_norm[ss==2,]
test_norm <- wine_data_norm[ss==3,]
Step 3: Train the SVM model
Start with a simple linear SVM.
# Begin training with a simple linear SVM
quality_classifier <- ksvm(quality ~.,
                           data = train_norm,
                           kernel = "vanilladot")
    ##  Setting default kernel parameters
print(quality_classifier)
    Support Vector Machine object of class "ksvm"
    SV type: eps-svr  (regression)
    parameter : epsilon = 0.1  cost C = 1
    Linear (vanilla) kernel function.
    Number of Support Vectors : 3518
    Objective Function Value : -2172.043
    Training error : 0.698757
In order to convert the normalized quality values back to it’s original scale (1-10), it will be useful to create a function that can be called to convert each predicted value to back to its respected quality value.
round_predictions <- function(x) {
  if(x >= 0 & x <= 0.125) {
    x = 4
  } else if (x > 0.125 & x <= 0.375) {
    x = 5
  } else if (x > 0.375 & x <= 0.625) {
    x = 6
  } else if (x > 0.625 & x <= 0.875) {
    x = 7
  } else if (x > 0.875 & x <= 1) {
    x = 8
  } else if (x < 0) { # Sometimes the model may output negative values, in this case set it equal to a quality of 4
    x = 4
  } else if (x > 1.0) { # Sometimes the model may output values greater than 1, in this case set it equal to a quality of 8
    x = 8
  }
}
Using the trained simple linear SVM model, apply the model to the normalized validation dataset to check the accuracy of the model.
# Normalized validation set predictions
pred_validation <- predict(quality_classifier, validation_norm)
# Convert the predictions back to integers and store the predicted values
pred_validation <- sapply(pred_validation, round_predictions)
# Accuracy of the model
mean(pred_validation == validation$quality)
    [1] 0.5399061
The simple linear SVM results in 54% accuracy.
Create a confusion matrix to visualize the classification accuracy.
validation_results <- data.frame(cbind(validation$quality, pred_validation))
# Change the column names of the table
names(validation_results) <- c("actual_quality", "predicted_quality")
# Remove rownames
rownames(validation_results) <- c()
# Convert predicted and actual results from numerical to factors
validation_results$actual_quality <- factor(validation_results$actual_quality, levels = c("4", "5", "6", "7", "8"))
validation_results$predicted_quality <- factor(validation_results$predicted_quality, levels = c("4", "5", "6", "7", "8"))
confusion_matrix <- as.data.frame(table(validation_results$actual_quality, validation_results$predicted_quality))
confusion_matrix <- validation_results %>%
  table() %>%
  as.data.frame()
ggplot(data = confusion_matrix,
       mapping = aes(x = actual_quality,
                     y = predicted_quality)) +
  geom_tile(aes(fill = Freq)) +
  geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
  scale_fill_gradient(low = "#ff7f50",
                      high = "#003767",
                      trans = "log")
 
      
    
  
  
    Step 4: Improve upon the model by implementing different kernels and fine-tuning the parameters
Because the data consists of multiple variables, a simple linear SVM may not result in the most accurate model. To try and improve the model, an RBF kernel with default parameters can be attempted next.
# RBF kernel SVM with default parameters
RBF_classifier <- svm(quality ~ .,
                           data = train_norm,
                           method = "C-classification",
                           kernel = "radial")
# Print out the parameters of cost and gamma from the updated algorithm
print(RBF_classifier)
    Call:
    svm(formula = quality ~ ., data = train_norm, method = "C-classification",
         kernel = "radial")
    Parameters:
        SVM-Type:  eps-regression
      SVM-Kernel:  radial
            cost:  1
           gamma:  0.09090909
         epsilon:  0.1
     Number of Support Vectors:  3329
Apply the new SVM model to the normalized validation dataset to check the accuracy of the model.
# Normalized validation dataset predictions
pred_validation <- predict(RBF_classifier, validation_norm)
# Convert the predictions back to integers and store the predicted values
pred_validation <- sapply(pred_validation, round_predictions)
# Accuracy of the model
mean(pred_validation == validation$quality)
     [1] 0.5821596
The RBF SVM model with default parameters results in 58.2% accuracy, an increase of 4.2% from the simple linear SVM.
Create a confusion matrix to visualize the classification accuracy.
validation_results <- data.frame(cbind(validation$quality, pred_validation))
# Change the column names of the table
names(validation_results) <- c("actual_quality", "predicted_quality")
# Remove rownames
rownames(validation_results) <- c()
# Convert predicted and actual results from numerical to factors
validation_results$actual_quality <- factor(validation_results$actual_quality, levels = c("4", "5", "6", "7", "8"))
validation_results$predicted_quality <- factor(validation_results$predicted_quality, levels = c("4", "5", "6", "7", "8"))
confusion_matrix <- as.data.frame(table(validation_results$actual_quality, validation_results$predicted_quality))
confusion_matrix <- validation_results %>%
  table() %>%
  as.data.frame()
ggplot(data = confusion_matrix,
       mapping = aes(x = actual_quality,
                     y = predicted_quality)) +
  geom_tile(aes(fill = Freq)) +
  geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
  scale_fill_gradient(low = "#ff7f50",
                      high = "#003767",
                      trans = "log")
 
      
    
  
  
    Lets try to tune and improve upon the RBF Kernel model by searching for the optimal parameters by using the tune.svm function.
# Obtain the column number of the quality variable
typeColNum <- grep("quality", names(wine_data))
# Find the optimal RBF parameters
RBF_tune <- tune.svm(x = train_norm[, -typeColNum],
                     y = train_norm[, typeColNum],
                     gamma = c(0.5, 1),
                     cost = c(1, 2),
                     epsilon = c(0.1, 0.2),
                     kernel = "radial")
# Print out the parameters of cost and gamma from the updated algorithm
RBF_tune$best.parameters$gamma
     [1] 1
RBF_tune$best.parameters$cost
     [1] 2
RBF_tune$best.parameters$epsilon
     [1] 0.2
Using the updated parameters for cost and gamma obtained from the tune.svm function, train the SVM model on the train dataset.
tuned_RBF_classifier <- svm(quality ~ .,
                            data = train_norm,
                            method = "C-classification",
                            kernel = "radial",
                            cost = RBF_tune$best.parameters$cost,
                            gamma = RBF_tune$best.parameters$gamma,
                            epsilon = RBF_tune$best.parameters$epsilon)
Apply the SVM model to the validation dataset to check the accuracy of the model
# Validation dataset predictions
pred_validation <- predict(tuned_RBF_classifier, validation_norm)
# Convert the predictions back to integers and store the predicted values
pred_validation <- sapply(pred_validation, round_predictions)
# Accuracy of the model
mean(pred_validation == validation$quality)
     [1] 0.6353678
The RBF SVM model with optimized parameters results in 61.2% accuracy, an increase of 3% from the default RBF SVM model.
Create a confusion matrix to visualize the classification accuracy.
validation_results <- data.frame(cbind(validation$quality, pred_validation))
# Change the column names of the table
names(validation_results) <- c("actual_quality", "predicted_quality")
# Remove rownames
rownames(validation_results) <- c()
# Convert predicted and actual results from numerical to factors
validation_results$actual_quality <- factor(validation_results$actual_quality, levels = c("4", "5", "6", "7", "8"))
validation_results$predicted_quality <- factor(validation_results$predicted_quality, levels = c("4", "5", "6", "7", "8"))
confusion_matrix <- as.data.frame(table(validation_results$actual_quality, validation_results$predicted_quality))
confusion_matrix <- validation_results %>%
  table() %>%
  as.data.frame()
ggplot(data = confusion_matrix,
       mapping = aes(x = actual_quality,
                     y = predicted_quality)) +
  geom_tile(aes(fill = Freq)) +
  geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
  scale_fill_gradient(low = "#ff7f50",
                      high = "#003767",
                      trans = "log")
 
      
    
  
  
    Let’s experiment with a different variation of the SVM model by using the polynomial kernel and optimizing the parameters using the tune.svm function. It is best to only optimize three parameters at once as increasing the parameters may result in extended or indefinite run times. Once a few parameters have been optimized, attempt to optimize the rest.
# Obtain the best parameters for polynomial type kernel for the SVM model
polynomial_tune <- tune.svm(x = train_norm[, -typeColNum],
                            y = train_norm[, typeColNum],
                            cost = c(0.1, 1),
                            coef0 = c(0.1, 1, 2),
                            degree = c(2, 3),
                            epsilon = 0.1,
                            gamma = 0.1,
                            kernel = "polynomial")
# Using the updated parameters for coef0 and degree obtained from the tune.svm function, train the SVM model on the train dataset
tuned_polynomial_classifier <- svm(quality ~ .,
                                  data = train_norm,
                                  method = "C-classification",
                                  kernel = "polynomial",
                                  cost = polynomial_tune$best.parameters$cost,
                                  coef0 = polynomial_tune$best.parameters$coef0,
                                  degree = polynomial_tune$best.parameters$degree,
                                  epsilon = polynomial_tune$best.parameters$epsilon,
                                  gamma = polynomial_tune$best.parameters$gamma)
# Validation set predictions
pred_validation <-predict(tuned_polynomial_classifier, validation_norm)
# Convert the predictions back to integers and store the predicted values
pred_validation <- sapply(pred_validation, round_predictions)
# Accuracy of the model
mean(pred_validation == validation$quality)
    ## [1] 0.556338
The polynomial SVM model with optimized parameters results in 55.6% accuracy, a decrease of 4.6% from the optimized RBF SVM model.
Create a confusion matrix to visualize the classification accuracy.
validation_results <- data.frame(cbind(validation$quality, pred_validation))
# Change the column names of the table
names(validation_results) <- c("actual_quality", "predicted_quality")
# Remove rownames
rownames(validation_results) <- c()
# Convert predicted and actual results from numerical to factors
validation_results$actual_quality <- factor(validation_results$actual_quality, levels = c("4", "5", "6", "7", "8"))
validation_results$predicted_quality <- factor(validation_results$predicted_quality, levels = c("4", "5", "6", "7", "8"))
confusion_matrix <- as.data.frame(table(validation_results$actual_quality, validation_results$predicted_quality))
confusion_matrix <- validation_results %>%
  table() %>%
  as.data.frame()
ggplot(data = confusion_matrix,
       mapping = aes(x = actual_quality,
                     y = predicted_quality)) +
  geom_tile(aes(fill = Freq)) +
  geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
  scale_fill_gradient(low = "#ff7f50",
                      high = "#003767",
                      trans = "log")
 
      
    
  
  
    To experiment further with one last SVM model, we’ll use the Sigmoid Kernel SVM, and fine tune the parameters as well.
# Obtain the best parameters for polynomial type kernel for the SVM model
sigmoid_tune <- tune.svm(x = train_norm[, -typeColNum],
                         y = train_norm[, typeColNum],
                         gamma = c(0.01, 0.1, 1),
                         coef0 = c(1, 5),
                         cost = 3,
                         epsilon = 0.2,
                         kernel = "sigmoid")
# Using the updated parameters for cost, gamma, coef0, and degree obtained from the tune.svm function, train the SVM model on the train dataset
tuned_sigmoid_classifier <- svm(quality ~ .,
                                data = train_norm,
                                method = "C-classification",
                                kernel = "sigmoid",
                                gamma = sigmoid_tune$best.parameters$gamma,
                                coef0 = sigmoid_tune$best.parameters$coef0,
                                cost = sigmoid_tune$best.parameters$cost,
                                epsilon = sigmoid_tune$best.parameters$epsilon)
# Validation set predictions
pred_validation <- predict(tuned_sigmoid_classifier, validation_norm)
# Store the predicted values
pred_validation <- sapply(pred_validation, round_predictions)
# Accuracy of the model
mean(pred_validation == validation$quality)
     [1] 0.456964
The sigmoid SVM model with optimized parameters results in 45.7% accuracy, a substantial decrease of 15.5% from the optimized RBF SVM model.
Create a confusion matrix to visualize the classification accuracy.
validation_results <- data.frame(cbind(validation$quality, pred_validation))
# Change the column names of the table
names(validation_results) <- c("actual_quality", "predicted_quality")
# Remove rownames
rownames(validation_results) <- c()
# Convert predicted and actual results from numerical to factors
validation_results$actual_quality <- factor(validation_results$actual_quality, levels = c("4", "5", "6", "7", "8"))
validation_results$predicted_quality <- factor(validation_results$predicted_quality, levels = c("4", "5", "6", "7", "8"))
confusion_matrix <- as.data.frame(table(validation_results$actual_quality, validation_results$predicted_quality))
confusion_matrix <- validation_results %>%
  table() %>%
  as.data.frame()
ggplot(data = confusion_matrix,
       mapping = aes(x = actual_quality,
                     y = predicted_quality)) +
  geom_tile(aes(fill = Freq)) +
  geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
  scale_fill_gradient(low = "#ff7f50",
                      high = "#003767",
                      trans = "log")
 
      
    
  
  
    Out of the three optimized models, the optimized RBF model resulted in the highest accuracy at 61.2%. However, with the accuracy maxing out just above 60%, it is clear that SVM models are either not the best classification method for this project or the data is simply too “noisy” and “non-informative”.
Step 5: Apply the best performing model to the test dataset
Although neither of the models performed well, we will finish the project by applying the best model discovered. Using the optimized parameters for cost, gamma, and epsilon obtained from the tune.svm function, we will apply the RBF Kernel model to the test dataset.
# Testdataset predictions
pred_test <- predict(tuned_RBF_classifier, test_norm)
# Convert the predictions back to integers and store the predicted values
pred_test <- sapply(pred_test, round_predictions)
# Accuracy of the model
mean(pred_test == test$quality)
     [1] 0.6109375
The final result is 61.1% accuracy
Create a confusion matrix with statistics
test_results <- data.frame(cbind(test$quality, pred_test))
names(test_results) <- c("actual_quality", "predicted_quality")
rownames(test_results) <- c()
test_results$actual_quality <- factor(test_results$actual_quality, levels = c("4", "5", "6", "7", "8"))
test_results$predicted_quality <- factor(test_results$predicted_quality, levels = c("4", "5", "6", "7", "8"))
str(test_results)
     'data.frame':    1280 obs. of  2 variables:
      $ actual_quality   : Factor w/ 5 levels "4","5","6","7",..: 3 2 4 5 3 3 3 3 2 4 ...
      $ predicted_quality: Factor w/ 5 levels "4","5","6","7",..: 3 3 3 4 2 3 3 3 2 3 ...
Let’s create a confusion Matrix and compare actualy quality vs predicted quality.
confusionMatrix(test_results$actual_quality, test_results$predicted_quality)
     Confusion Matrix and Statistics
               Reference
     Prediction   4   5   6   7   8
              4   3  20  26   0   0
              5   0 229 194   5   0
              6   0  80 437  27   0
              7   0   5 108 106   1
              8   0   0  20  12   7
    Overall Statistics
                    Accuracy : 0.6109
                      95% CI : (0.5836, 0.6378)
         No Information Rate : 0.6133
         P-Value [Acc > NIR] : 0.5804
                       Kappa : 0.3841
      Mcnemar's Test P-Value : NA
     Statistics by Class:
                          Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
     Sensitivity          1.000000   0.6856   0.5567  0.70667 0.875000
     Specificity          0.963978   0.7896   0.7838  0.89912 0.974843
     Pos Pred Value       0.061224   0.5350   0.8033  0.48182 0.179487
     Neg Pred Value       1.000000   0.8768   0.5272  0.95849 0.999194
     Prevalence           0.002344   0.2609   0.6133  0.11719 0.006250
     Detection Rate       0.002344   0.1789   0.3414  0.08281 0.005469
     Detection Prevalence 0.038281   0.3344   0.4250  0.17188 0.030469
     Balanced Accuracy    0.981989   0.7376   0.6703  0.80289 0.924921
Create a confusion matrix to visualize the classification accuracy.
confusion_matrix <- as.data.frame(table(test_results$actual_quality, test_results$predicted_quality))
confusion_matrix <- test_results %>%
  table() %>%
  as.data.frame()
ggplot(data = confusion_matrix,
       mapping = aes(x = actual_quality,
                     y = predicted_quality)) +
  geom_tile(aes(fill = Freq)) +
  geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
  scale_fill_gradient(low = "#ff7f50",
                      high = "#003767",
                      trans = "log")
 
      
    
  
  
    Conclusion:
Upon examining the visualization of the confusion matrix, it is clear that model excelled only at classifying wines with a quality of 6 with an accuracy rate of 75.4% despite having removed quality three and nine wines. The model struggles to classify all other qualities of wine with all others having an accuracy rate below 63%. Based on prior experience working with classification datasets, the problem appears to stem with the data either being too noisy, non-informative, or a combination of both. It is likely that the physiochemical properties of the various wines provide no insight into the quality of the wine. Another problem that arises with the dataset is there is no clear scientific methodology for classifying and rating the qualities of the wines. This information is crucial as there is no way to verify the reasons behind each rating and whether the reasons remain consistent with each wine. Moving forward with this project would require information on how the rating system works and possibly introducing further data such as the what grapes were used, the age of the wine, and how the wine was distilled and stored.
Regardless of the results, this was a fun project that demonstrates how widely applicable data science and machine learning is. Careful not to get too drunk on data and information though!
