This report explores physicochemical properties of red and white wines and tries to assess which factors influence wine quality the most. Additionally, relationships between the different parameters will be investigated.

Data Set

The two data sets used during this analysis were developed by Cortez et al.1. They are publicly available for research purposes. The sets contain physicochemical properties of red and white Vinho Verdes wines and their respective sensory qualities as assessed by wine experts.For easier handling both sets were combined into a single dataframe.

The set includes samples of 1599 different red and 4898 white wines regarding the following attributes:

Atrribute Unit Description
Fixed Acidity g/L Concentration of non-volatile tartaric acid in the wine.
Volatile Acidity g/L Concentration of volatile acetic acid in the wine.
Citric Acid g/L Concentration of citric acid in the wine.
Residual Sugar g/L Concentration of sugar remaining after the fermentation in the wine.
Chlorides g/L Concentration of sodium chloride in the wine.
Free Sulfur Dioxide mg/L Concentration of free, gaseous sulfur dioxide in the wine.
Total Sulfur Dioxide mg/L Total concentration of sulfur dioxide in the wine.
Density g/cm3 Density of the wine.
pH 1 - 14 Acidity of the wine.
Sulphates g/L Concentration of potassium sulfate in the wine.
Alcohol vol% Alcohol content of the wine.
Quality 1 - 10 Wine quality score as assessed by experts.

Summary Statistics of the Data Set

By printing out the summary statistics for the data set we can have a first look at the different value ranges for all attributes.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800  
##  Median : 7.000   Median :0.2900   Median :0.3100   Median : 3.000  
##  Mean   : 7.215   Mean   :0.3397   Mean   :0.3186   Mean   : 5.443  
##  3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900   3rd Qu.: 8.100  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  1.00      Min.   :  6.0       
##  1st Qu.:0.03800   1st Qu.: 17.00      1st Qu.: 77.0       
##  Median :0.04700   Median : 29.00      Median :118.0       
##  Mean   :0.05603   Mean   : 30.53      Mean   :115.7       
##  3rd Qu.:0.06500   3rd Qu.: 41.00      3rd Qu.:156.0       
##  Max.   :0.61100   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50  
##  Median :0.9949   Median :3.210   Median :0.5100   Median :10.30  
##  Mean   :0.9947   Mean   :3.219   Mean   :0.5313   Mean   :10.49  
##  3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.30  
##  Max.   :1.0390   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality         color          
##  Min.   :3.000   Length:6497       
##  1st Qu.:5.000   Class :character  
##  Median :6.000   Mode  :character  
##  Mean   :5.818                     
##  3rd Qu.:6.000                     
##  Max.   :9.000

One can see that most characteristics have some significant outliers as the maximum value is mucht bigger than their third quantile.

Univariate Plots

In this section univariate plots will be created to understand the individual variables in the data set.

Wine Quality

Wine quality shows a rather symmetrical distribution. Most wines have a quality score of 6. No wine achieved the highest score of 10 and the worst wines got a rating of 3.

Acidity

Fixed and volatile acidity show a positively skewed distrbution. Citric acid forms an edge peak distribution because a group of wines seems to have citric acid concentrations close to 0. The pH histogram appears more symmetrical. Only a few outliers are present for the four parameters.

Looking at the acidity parameters in boxplots gives a similar picture. One can see the long positive tails of fixed and volatile acide concentrations and narrower distributions for citric acid and pH. This observation is also confirmed by the mean values that are displayed as “x” in the plots. They are closer to the respective medians for citric acid and pH than for fixed and volatile acidity.

Sulfur Content

Free sulfur dioxide concentration is narrowly centered around 30 mg/L. Total sulfur dioxide concentration shows signs of bimodality with peaks around 20 and 120 mg/L.

When plotting the ratio between free and total sulfur dioxide in the wines, one can see that about 25 % of the total sulfur dioxide occurs in its free form. The distribution is positively skewed with a few wines that have considerably higher ratios. Eye-catching is also the peak that occurs at exactly 0.5.

Most wines have a sulphate concentration around 0.5 g/L. Two small outlier groups around 1.6 and 1.9 g/L can be seen in the boxplot.

Residual Sugar

Generally, the wines in the data set appear to have low residual sugar concentrations. The positive skewing moves the mean value (5.4) above the median (3.0). An extreme outlier can be found around 65 g/L residual sugar.

Alcohol

The alcohol content of the wines in the data set ranges between 8 and 15 vol%. The median lies around 10 vol%. The distribution is rather wide and shows positive skewing.

Density

The density parameter shows a very narrow distribution with low variation. One can see a few outliers around 1.01 and 1.04 g/cm3 but most wines have a density between 0.99 and 1.00 g/cm3.

Chlorides

The histogram showing the chlorine concentration in the data set has two distinct main peaks. The most frequent chlorine concentrations can be found around 0.04 g/L. The second peak appears at about 0.08 g/L. The distribution has a very long tail in the positive direction with outliers up to 0.6 g/L.

Univariate Analysis

After investigating all features individually it is time to note down some first findings and insights.

Main observations:

Most distributions encountered during the exploration of the parameters looked rather usual. In general, they were positively skewed with a narrow main peak. Total sulfur dioxide as well as chloride concentration showed bimodal distributions. Density showed the least variation, so I think it is a less interesting feature to explore further. I would also expect density to correlate with alcohol content since ethanol has a lower density than water. On the other hand, I expect alcohol content to be inversely proportional to the residual sugar concentration as sugar is converted to alcohol during the fermentation process. pH is a measure for the acidity of a solution so I predict it to correlate negatively (the more acid the lower the pH) with fixed and volatile acidity as well as citric acid concentration.

Based on these thoughts, I think the main features of interest to predict wine quality are the different acidity parameters, the variables describing sulfur content, alcohol or sugar content as well as the chlorine concentration.

I also added two new variables to the data frame during univariate exploration. The first one is the total fixed acidity, which is the sum of fixed acidity (tartaric acid concentration) and citric acid concentration. The second new parameter is the ratio between free and total sulfur dioxide concentration in the wine. These will be investigated further in the next sections.

Bivariate Plots

In this section we are going to create bivariate plots to investigate the relationships between two different parameters.

Correlation Matrix

I compared all the variables in the correlation matrix above. I was interested in the variables that influenced the wine quality as well as relationships between the features themselves. These were my main findings:

  • Density and quality have a correlation factor of -0.3.
  • Alcohol content correlates positively with wine quality.
  • Low correlation between quality and chloride concentration as well as between quality and sulfur dioxide ratio.
  • Wine quality also slightly negatively correlates with volatile acidity.
  • There seems to be a connection between sulfur dioxide and residual sugar in wines.
  • Expected influence of alcohol content and residual sugar concentration on wine density.
  • Sulphates vs. chlorides
  • Correlation between fixed acidity and total fixed acidity as the first one is part of the latter.
  • Same holds true for free sulfur dioxide, total sulfur dioxide and the sulfur dioxide ratio.
  • Color shows relationships with density, residual sugar, total sulfur dioxide and volatile acidity.

I am going to take a closer look at some of these findings in the next sections.

Parameters that Influence Wine Quality

Density vs. Quality

The plot above shows the wine densities grouped by their quality ratings. Outliers were neglected in the display but are included in the calculations. The blue line connects the median values of the different quality groups while a linear trenndline was added in red. We can observe a negative trend between density and quality but as we have big density variations within the different quality groups I do not expect density to be a good predictor variable for wine quality.

We get a similar picture when we visualize the relationship between quality and density using boxplots. Wines with a lower density tend to have better quality but density varies in similar windows within all quality groups.

Alcohol vs. Quality

The boxplot shows that wines with higher quality seem to have a higher alcohol content. But the relationship does not seem very significant as the boxes are very wide and overlap for the different categories.

Chlorides vs. Quality

Wines with lower chloride concentrations tend to be of better quality but the effect seems very weak. The boxes are wide and one can see a lot of outliers for the mid quality wines.

Volatile Acidity vs. Quality

Again we can only see a very weak negative correlation in the visualization of acetic acid concentration versus wine quality.

Total Sulfur Dioxide vs. Residual Sugar

For the a large portion of wines total sulfur dioxide concentration does not seem to be affected by the residual sugars. We can see that wines span the whole range of sulfur dioxide concentrations for low residual sugar levels. But it looks like wines with high residual amounts of sugar also often have hight total sulfur dioxide concentrations.

Influence of Alcohol and Sugar on Wine Density

Alcohol content as well as residual sugar concentration show the expected influence on wine density. Higher alcohol fraction lowers wine density while more residual sugar increases it. Sugar has a higher density than water and thus increases the density of the mixture while alcohol does the opposite. It looks like there are two different regimes and trends in the plot of density versus sugar. One includes wines with low sugar concentrations but a wide variety in density while the other shows a somewhat linear relationship for higher sugar contents. This might be attributed to a third parameter and we will have to investigate this finding further in a later section.

Variations Based on Color

Color and Density

In the density distribution plot above one can see that white and red wines vary in density. Red wines show a rather narrow distribution with a median value around 0.997 g/cm3. White wines on the other hand show a much wider variation in density but their median lies below the one of red wines around 0.993 g/cm3.

Color and Residual Sugar

Residual sugar concentration is lower for red wines than for white ones and their distribution is narrrower. This finding coincides with my expection as white wines in general are sweeter in taste.

Color and Total Sulfur Dioxide

The histogram above shows the reltionship between wine color and total sulfur dioxide concentration. The solid lines denote median values while the dashed ones represent the first and third quantile values. Most red wines have a low total concentration of sulfur dioxide while white wines show a symmetrical distribution around their higher median of 130 g/L.

Color and Fixed Sulfur Dioxide

By looking at the difference between total and free sulfur dioxide, we can clearly say that fixed sulfur dioxide accounts for the difference between wine colors.

Color and Volatile Acidity

The distribution of volatile acidity shows its main peak at a lower concentration for white than for red wines. While the curve for white wines is narrow with positive skewing, the red wine density curve is broad and shows signs of bimodality.

Bivariate Analysis

In general, I expected more factors to correlate with wine quality. I am especially surprised that individual ingredients such as differen acid and sulfur compounds do not correlate stronger with quality ratings.

In general I would not consider a correlation below 0.3 meaningful but nevertheless I identified a few parameters close to this level that seem to influence wine quality at least a bit.

These are: density, alcohol content, chloride concentration and volatile acidity. Of these parameters, alcohol fraction and density showed the strongest correlation with wine quality which is surprising to me as I expected chemical composition to be more important.

Since I have not found chemical composition to significantly influence wine quality in the bivariate part of my exploration, I am going to look at more complicated ingredient combinations in the next section of the report.

As density of a mixture depends on the densities of its individual components,I want to further investigate wine density based on alcohol content, residual sugar and fixed acidity as these are the main components of the wines by mass and volume. Also, Density and residual sugar concentration showed two different trend regimes. I am going to inspect this behaviour further and expect wine color to be the driving factor behind this observation.

I also found interesting differences between wine color and sulfur dioxide content as well as between color and volatile acidity. Since volatile acidity also affects wine quality, we have to check how color affects that relationship and if quality is affected by the same parameters for red and white wines.

Multivariate Plots Section

Color and Wine Quality

First of all I want to check if all the parameters I have identified to influence wine quality do this independent of wine color.

Density vs. Quality

For both wine groups quality negatively correlates with density. But the effect appears to be a litte stronger for white wines.

Alcohol vs. Quality

Also for alcohol content we see the same trend for both wine colors.

Chlorides vs. Quality

White wines have in general lower chloride concentrations than red wines. But both groups show a slightly negative trend between chloride concentration and quality rating but the effect is very weak and there are a lot of outliers present for medium quality wines.

Volatile Acidity vs. Quality

Acetic acid concentration strongly effects the quality of red wines but seems to have no significant correlation with the score of red ones.

Ingredients and Wine Quality

In the following sections I am gooing to look at ingredient combinations and see if any correlation with wine quality can be found.

Chlorides and Sulphates

In the graphs above one can see that red wines tend to have hight chloride and sulphate concentrations than white ones but there is no visible relationship between chloride/sulphate ratio and wine quality.

Sulfur Dioxide Ratio and Volatile Acidity Ratio

High quality red wines have low volatile acidity ratios but a wide variaty of free sulfur dioxide contents. White wines are centered around a volatile acidity ratio of about 0.04 and have a slightly lower variation in free sulfur dioxide ratio than red wines. Because of the narrow clustering of the data points it is difficult to see any influence on quality of these two factors for white wines.

In the plot for the red wines we can see a line of data points at a free sulfur dioxide ratio of 0.5. We already observed this behaviour in the univariate analysis of this ratio. This might be attributed to the method used for measuring sulfur dioxide cocentrations.

Volatile Acidity and Alcohol

For both wine groups good quality goes hand in hand with low volatile acidity concentrations and high alcohol contents. As already seen during the bivariate analysis, white wines in general have lower volatile acidity contents.

Main Factors Influencing Wine Quality

Even tough I have not found any new factors influencing wine quality I want to summarize the three parameters that influence quality the most visually in this section.

Alcohol and Volatile Acidity

The plot above shows the positive correlation between alcohol and quality for red and white wines. A slight negative correlation between quality and and volatile acidity concentration is visible in the red wine plot but hard to see for white wines as they tend to have lower acetic acid contents.

Alcohol and Chlorides

ggplot(aes(x = alcohol, y = quality, color = chlorides),
       data = subset(wines, wines$chlorides < 
                       quantile(wines$chlorides, 0.99))) +
  geom_jitter(size = 1, alpha = 0.8) +
  scale_colour_gradient(low = '#c9f42d', high = '#080118') +
  facet_wrap(~color) +
  coord_cartesian(xlim = c(quantile(wines$alcohol, 0.01), 
                           quantile(wines$alcohol, 0.99))) +
  labs(x = 'Alcohol Content [vol%]', y = 'Quality')

Here one can come to very similar conclusions as in the previous section. High alcohol content increases wine quality while chloride concentration lowers it. Again, white wines contain lower concentrations of this ingredient and the effect is mainly visible in the red wine plot.

Investigating Wine Density

Density of a mixture is defined by the densities of its individual components. As alcohol is the main component by volume, I expect it to influence density the most. Goal of this section will be to build a simple model that allows to predict wine density based on alcohol, sugar and fixed acidity content.

Density by Alcohol and Residual Sugar

Since residual sugar concentration contains an extreme outliere above 60 g/L, I excluded this one from the visualisation to get a more decisive color gradient. One can clearly see the expected behaviour in this plot. Alcohol drives density down while sugar increases it. One can definitely see the linear relationship between density and alcohol in the plots above. As red wines in general have lower sugar levels than white ones, alcohol is the main factor that influences density. This is also the reason for the two different trends we have seen earlier when we plotted wine density versus residual sugar concentration. In red wines, density is controlled by other ingredients than sugar.

This plot displays the behaviour mentioned above about the two different density regimes.

Density by Alcohol and Fixed Acidity

Wines with high fixed acidity concentrations have higher densities and white wines in general have lower tartaric acid concentrations. This observation supports the thesis that red wine density is dominated by aclohol and fixed acid content while white wine density depends mainly on alcohol and amount of sugar.

Predicting Density

We identified alcohol, residual sugar and fixed acidity as the main contributors to wine density. Let us put them in a linear model and see how it performs.

## 
## Calls:
## m1: lm(formula = I(density) ~ I(alcohol), data = wines)
## m2: lm(formula = I(density) ~ I(alcohol) + residual.sugar, data = wines)
## m3: lm(formula = I(density) ~ I(alcohol) + residual.sugar + fixed.acidity, 
##     data = wines)
## 
## =========================================================
##                       m1           m2           m3       
## ---------------------------------------------------------
##   (Intercept)      1.01281***   1.00828***   0.99844***  
##                   (0.00024)    (0.00024)    (0.00021)    
##   I(alcohol)      -0.00173***  -0.00141***  -0.00123***  
##                   (0.00002)    (0.00002)    (0.00002)    
##   residual.sugar                0.00022***   0.00027***  
##                                (0.00001)    (0.00000)    
##   fixed.acidity                              0.00106***  
##                                             (0.00001)    
## ---------------------------------------------------------
##   R-squared            0.472        0.579        0.784   
##   adj. R-squared       0.472        0.579        0.783   
##   sigma                0.002        0.002        0.001   
##   F                 5797.273     4464.270     7836.772   
##   p                    0.000        0.000        0.000   
##   Log-likelihood   30598.875    31336.328    33498.633   
##   Deviance             0.031        0.025        0.013   
##   AIC             -61191.750   -62664.655   -66987.265   
##   BIC             -61171.413   -62637.539   -66953.370   
##   N                 6497         6497         6497       
## =========================================================

Using all three parameters in the model we can account for about 78 % of the variance in the density of wines.

By adding the next two most common wine ingredients (sulphates and citric acid) to the model it is possible to increase its accuracy up to 85 %.

## 
## Calls:
## m4: lm(formula = I(density) ~ I(alcohol) + residual.sugar + fixed.acidity + 
##     sulphates, data = wines)
## m5: lm(formula = I(density) ~ I(alcohol) + residual.sugar + fixed.acidity + 
##     sulphates + citric.acid, data = wines)
## 
## ============================================
##                       m4           m5       
## --------------------------------------------
##   (Intercept)      0.99672***   0.99634***  
##                   (0.00019)    (0.00019)    
##   I(alcohol)      -0.00121***  -0.00117***  
##                   (0.00001)    (0.00001)    
##   residual.sugar   0.00029***   0.00031***  
##                   (0.00000)    (0.00000)    
##   fixed.acidity    0.00092***   0.00103***  
##                   (0.00001)    (0.00001)    
##   sulphates        0.00455***   0.00451***  
##                   (0.00011)    (0.00010)    
##   citric.acid                  -0.00283***  
##                                (0.00011)    
## --------------------------------------------
##   R-squared            0.829        0.845   
##   adj. R-squared       0.829        0.845   
##   sigma                0.001        0.001   
##   F                 7849.220     7061.258   
##   p                    0.000        0.000   
##   Log-likelihood   34257.170    34576.588   
##   Deviance             0.010        0.009   
##   AIC             -68502.339   -69139.176   
##   BIC             -68461.664   -69091.722   
##   N                 6497         6497       
## ============================================

Multivariate Analysis

By further investigating the insights gained during the bivariate analysis I could deepen my understanding of the data set.

First of all I was able to confirm that density, alcohol, chlorides and volatile acidity influence wine quality independently of color. But these effects were generally stronger for red wines.

Next I looked at ingredient combinations and their influence on wine quality. The different combinations cleary showed variations between red and white wine compositions. For example, red wines contain high chloride and sulphate concentrations while white wines can be characterized by a low volatile acidity content. Nevertheless, I was not able to find any combinations that affected wine quality. I also observed a set of strange data points for the sulfur dioxide concentration in red wines. Several wines had a free sulfur dioxide ratio of exactly 50 %. To find out the reason for that, one would have to further investigate how the data set was acquired. It might be attributed to the chosen measuring technique.

In a last part I created a simple regression model to predict wine density. I found negative correlation between density and alcohol content as well as positive correlation between density, residual sugar, fixed acidity, sulphates and citric acid. The best model accounted for 85 % of the variance in wine density. Biggest contributors to these result were alcohol, fixed acidity and sugar concentration.


Summary Plots

Plot One

The quality distribution of wines appears to be independent of their color. Most red wines achieved a quality score of 5 while the white wine distribution peaks at 6.

Plot Two

Plot number two summarizes the three parameters that influence wine quality the most. Alcohol was found to be the most important one and is therefore displayed on the x-axis. Volatile acidty and chloride concentration are added to the plot through the color scale of the data points.

For both wine colors one can see a positive correlation between quality and alcohol content. This effect seems stronger for red wines as there is a large number of white wines with low alcohol content but high quality ratings. Volatile acidity shows a slight negative correlation with wine quality and so does chloride concentration. These effects are mainly visible for red wines as they tend to have higher concentrations of the two components.

Plot Three

The two visualisations above show the linear correlations that have been found for wine density. The density of a mixture is a combination of the densities of its individual components. Components with a low density (lower than water which is the main component of wine) such as alcohol decrease mixture density while sugar and tartaric acid behave vice versa. These plots also suggested that a linear regression model should be able to predict wine density based on component cocentrations.


Reflection

Combined the two data sets contain information on 6497 red and white wines. The columns contain information on physicochemical properties of the wines as well as a quality score based on tastings by experts. I started my exploration by looking at the variables separately in an univariate analysis section and used the insights gained to study relationships between two and more variables simultaneously. In the end, I was able to identify the parameters that influence wine quality the most and derive a simple linear regression model to predict wine density based on the concentrations of its main ingredients.

I was surprised that wine quality was not influenced more strongly by the different parameters in the data set. I found only medium to weak correlations between quality and density, alcohol content, chloride concentration and volatile acidity (descending order of correlation). Even having a closer look at combinations of different ingredients did not reveal more relationships with quality. I also found out that these correlations are more or less independent of wine color but red and white wines contain different amounts of the same ingredients. Red wines have higher volatile acidity, sulfur dioxide, chloride and sulphate concentrations while white wines contain much more residual sugar.

Based on the simple assumptions that the density of a mixture can be calculated as a linear combination of the densities of its indivual components, I created a simple regression model to predict wine density. Using the five componentes with the highest concentrations, the model was able to account for 84 % of the variance in wine densities. It has to be mentioned here that this approach only makes sense for ideal solutions. In ideal solutions, the volume of the mixture is equal to the sum of the volumes of its individual components. But this is not the case for water and ethanol. So a linear model is only a rough approximation for wine densities.

For future investigation of the data set I would consider a few different ideas that came to my mind during data exploration. First of all I saw that there are no parameters that influence wine quality significantly. So far I would conclude that wine quality mainly depends on the personal taste of the rating experts. Therefore one should follow a new path and reconsider the parameter selection and record new ones, such as grape type and year of production or one could even consider montitoring drinking conditions like temperature or humidity. Since wine quality is a categorical variable regression models do not work for predictions but one could try to find a classification model that does the job based on the new parameters found to correlate with wine quality. Futhermore, one should probably have a closer look at red and white wines individually. Exploring the combined data set showed me that their composition is very different and it might need to different approaches to predict wine quality for both colors.


  1. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.