Yongxin Cui, Student, Business Analytic and Project Management Program, University of Connecticut School of Business
Tianhao Zhao, Student, Business Analytic and Project Management Program, University of Connecticut School of Business
Xinyin Miao, Student, Business Analytic and Project Management Program, University of Connecticut School of Business
1. Executive Summary
Discerning wine quality is especially important to wine merchants who sell to the general public, since they usually do not have the resources to stock a huge inventory of wines and need to be sure their inventory includes high-quality wines that satisfy the more and more discerning palate of the general public. Our team decided to use the data mining tools available in JMP to examine whether a wine’s quality, a subjective rating by the public, could be predicted from 11 descriptive variables (consisting of wine chemistry measurements or calculations commonly made from those measurements). Our goal was a parsimonious model, one with the smallest number of variables, that would help a winery to produce higher quality wine. What is more, our team would analyze the difference between red wine and white wine. Our preliminary conclusion is that for red wine four variables (volatile acidity, total sulfur dioxide, sulfates, and alcohol) may be sufficient predictors of high quality, while for white wine five variables (fixed acidity, volatile acidity, residual sugar, sulfates, and alcohol) may be sufficient predictors of high quality.
In this section, SEMMA method is used to explore the red and white wine datasets.
Both the red wine and white wine dataset have 12 variables. The red wine dataset has 1599 rows and the white wine dataset has 4898 rows. For this project, it was not necessary to utilize smaller samples of these datasets since both are small enough to process with the computer resources available and large enough to contain significant information.
For each wine, the dataset contains 11 predictors related to the physical and chemical properties of the wine and one subjective rating of wine quality. The measurement variables are all continuous. The quality rating variable is ordinal and ranges from 1 (lowest quality) to 10 (highest quality). The predictor variables are all continuous variables describing either the physical or chemical properties of the wine. Initially, density was supposed to have a strong relationship with some of the other variables, such as sulfur dioxide, sulfates, and other chemical components might influence the density of the wine. Moreover, the report of the distribution function in JMP indicates that each predictor variable was determined to be almost normal, except for residual sugar, which is highly left-skewed. If necessary residual sugar will be transformed in the analysis. The widely spread distributions of residual sugar and chlorides might also represent or hide outliers.
(1) Correct inconsistencies and obvious errors
There is no inconsistencies and obvious errors in the two datasets.
(2) Change the data type
Ideally, this data would determine the target, "quality level," as a continuous variable. But this could have led to problems. This quality level must lie between 1 and 10 to correspond to the subjective quality rating. If the quality level were continuous it could lead to unrealistic results (quality levels greater than 10 or less than 0). Making the quality level an ordinal that corresponded to the subjective ratings wouldn't make things better since in both cases a cut-off between high- and low-quality wines would still have to be determined. A single binary ordinal quality level would probably be better for merchants if most merchants would want to stock a broad range of better-quality wines. Since the subjective quality ratings for both red and white wines had a median value of 6, a quality level was chosen to be high if the subjective rating score was equal to or greater than 6 and low if the subjective rating score was below 6. The left one is red wine quality level, and the right one is a white wine quality level.
(3) Detect and Deal with outliers
The outliers or extreme values in the dataset might influence the analysis of the rest of the data. The result of Explore Outliers in JMP indicates that the two datasets include some extreme values, most of which are concentrated in the residual sugar, chlorides and sulfates. Also, the Robust Analysis function further locates “extreme” rows. In the red wine, the concentration of chlorides in rows 152 and 259 are 0.61 and 0.611, which are much higher than the values in other rows. Wine research has shown that the chloride can harm the health of people and many countries set upper limits on the amounts of chlorides. For example, in Australia the maximum level of chloride allowed is 607 mg/L, which is set as the chloride threshold value in this project. Also, there are some other legal limits. The list is as below.
(4) Handling Missing Value
Missing value occurs when no data is stored in observations. Function Explorer Missing Values shows that there are no missing values in red and white wine datasets.
(5) Data Reduction
Some variables might contain similar information, and including them could introduce unnecessary complexity into the model and perhaps even make the interpretation more difficult or confusing. Before making the model, it is necessary to exclude any variable which contains the same information as another variable. The Multivariate Function in JMP identifies relationships between dependent variables. In this project, two variables are supposed to contain potential overlap information if correlation value between them is greater than 0.6. Then, professional knowledge is needed to confirm the assumption.
Principal Components Analysis
Another method to reduce the overlap information between variables is Principal Components Analysis, which can remove the overlap of information between variables, reduce model’s redundancy and allow a smaller number of variables to have the ability to represent a huge amount of information. The PCA function in JMP creates principal components containing the most of information of dataset with fewer variables. In general, principal components whose eigenvalue is greater 1 should be chose. However, the first seven principal components in the red and white are supposed to be used since they contain approximately 90 percent of the information.
e. Supervised Learning- Classification Model
In this project, logistic regression, decision tree, bootstrap tree and forest, and neural networks were used to make model.
(1) Logistic Regression
Logistic regression can be used in the model where the outcome variables are ordinal or nominal, and this method is widely used, particularly where a structure model is intended to explain or to predict. Since in this wine analysis the goal is to see whether the wine is high quality or low quality, logistic regression is one of the best methods to use. The Fit Model Function in JMP could give a report of the model. Then, variables whose p-value is greater than 0.05 will be removed from the model if the misclassification rate is not influenced. Also, the Stepwise Function in the fit model could be used to find the best variable choice with max R-Square validation.
(2) Decision Tree (Partition)
For the Decision Tree Method, variables that are not transformed are input into the Partition Function, since any monotone transformation of the variables will give the same tree. The go button in the Partition function can be used to find the best splits automatically. As a result, the best split for red wine is 8 times and for white wine is 14 times. However, based on the split history, the value of R-square can be similar when we reduce the number of splits to 6 and 8 respectively. This method can reduce the complexity of the model without affecting its accuracy.
(3) Bootstrap Tree and Bootstrap Forest
Like the Decision Tree Method, JMP can report the result of the booted tree and booted forest model when variables are input in the function. Importantly, in order to prevent the forest model from becoming too complex, we reduce the max split time from 2000 to 25.
(4) Neural networks
The advantage of neural networks is their potential to deliver high predictive performance. Their structure supports capturing very complex relationships between predictors and a response, which is often not possible with other predictive models. But neural network input variables must be pre-selected since the neural network cannot select variables by itself. The common approach to finding the best performing neural network is to start with a very simple default neural model, with one layer and three nodes, then build a more complex model with two layers, several nodes, and different activation functions, and continue this process until satisfactory predictions are obtained.
f. Results & Access
The comparing matrix indicates that the Entropy R-Square and Generalized R-Square of the Boosted Forest are the highest one, and RMSE and misclassification rate are the lowest. Also, the accuracy of Booted Forest is also the highest one. However, logistic regression is chosen as the best model, even though it's each parameter is not the best. The parameter is close to that of other models. Furthermore, the logistic model is simple, and predictors are easy to be interpreted. As a result, the logistic regression models of red and white are supposed to be the best model.
3. Conclusions and Recommendations
In the red wine, Alcohol, Residual Sugar, Sulfates, and Total Sulfur Dioxide have a Positive Relationship with high quality, while Fixed Acidity and Volatile Acidity have a negative relationship with high quality. In white wine, Alcohol, and Sulphate have a positive relationship with high-quality, while Volatile Acidity, Citric Acid, and Total Sulfur Dioxide have a low relationship with high quality. The model formulas are as below. The left one the red wine while the right one is white wine. (Note: The formula is for low quality)
Acids are one of 4 fundamental traits in wine and give wine tart and sour taste. Acidity adds a sense of vitality and refreshment to the wine and also make the flavor of the wine more prominent. Acid has an appetizing effect, so you want to have another cup. In Low levels, it adds to the complexity of flavor but in High levels it causes the wine to spoil.
TOTOL SULFUR DIOXIDE AND SULPHATES
Sulfites preserve wine and slow chemical reactions, which cause the wine to go bad. Stop bacteria and yeasts from growing, keeping the fresh taste of the wine. However, the high sulfur dioxide content will produce an unpleasant smell like a rotten egg, which may cause health problems after drinking.
Residual Sugar (or RS) is from natural grape sugars leftover in a wine after the alcoholic fermentation finishes. It can improve one of 4 fundamental traits in wine, the sweetness.
Alcohol is also one of 4 fundamental traits. It can improve lymphatic and it often manifests as a burning sensation in the base of the tongue and in the throat.
Recommendations for manufacture
In White Wine, it is good to manage Volatile Acidity level, decrease Fixed Acidity, add fewer residual sugars, increase both sulfur dioxide and sulfate, and Increase alcohol Concentration within Proper Limit. In red wine, it is good to manage Volatile Acidity level, decrease Citric acidity in red wine, keep or even reduce total sulfur dioxide, add more sulfate and Increase alcohol Concentration within Proper Limit.
Recommendations for consumers
Color, Alcohol, Smell are a good signal for the quality of the wine. Choose the wine with light color. The darker the color, the greater the level of acidity. But the acidity has a negative relationship with quality level. Smell the cork before buy it. The rotten egg smell means the wine is unhealthy especially for red wine. Choose the wine with higher alcohol concentration if you can accept.
Coli, Marina Sonegheti, et al. “Chloride Concentration in Red Wines: Influence of Terroir and Grape Type.” Food Science and Technology (Campinas), Vol. 35, no. 1, 2015, pp. 95-99., doi:10.1590/1678-457x.6493.
Almeida, P. J., et al. “Free Sulfur Dioxide in Beer as the Difference between Total Sulfur Dioxide and Acetaldehyde: A Voltammetric Approach.” Journal of the American Society of Brewing Chemists, Vol. 61, no. 4, 2003, pp. 191-195., doi:10:1094/asbcj-61-0191.
... View more