Red Wine Quality by Alex Raya

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

This is a Udacity dataset on the chemical properties of red wine with 11 variables, along with a quality column from 3 judges that rates the wine from 0 (very bad) to 10 (excellent), and this exploratory analysis is to better understand what chemical properties are most impactful upon the quality of red wine.

Univariate Plots Section

This is an intial plot displaying the distribution of fixed acidity in the dataset.

This plot displays the distribution of volatile acidity in the dataset, which looks fairly familiar to the distributive appearance for fixed acidity.

This plot displays the distribution of citric.acid in the dataset, which shows the existence of many values = 0.

This plot for residual sugar shows that sugar has many outlier values that skew the data, beyond about the value of 4.

This plot for chlorides displays a data pattern similar to residual sugar where there are lots of outlier values that skew the data, a good limit being about 0.15 or so to eliminate outliers.

These plots for free and total sulfur dioxide also display skews due to outliers, beyond about a limit of 40 for free and 150 for total sulfur dioxide.

The plot for density displays a normalized standard distribution, but the variance for most values is very small.

This plot for density also displays a normalized standard distribution, but there are variances in the values to warrant analysis.

The plot for sulphates displays large outliers like the other plots, beyond a limit of about 1.

This plot for alcohol displays most values being between about 8 and 14.

From the plot for quality an interesting aspect of this dataset is displayed, despite being on a scale of 0 to 10, quality is almost entirely concentrated within the values of 5 and 6, with some wines rating 7, very few at 8 or lower than 5, and none higher than 8. This skew may impact what the outcome of the analysis is.

Univariate Analysis

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

##  fixed.acidity   volatile.acidity citric.acid     residual.sugar 
##  Mode :logical   Mode :logical    Mode :logical   Mode :logical  
##  FALSE:1599      FALSE:1599       FALSE:1599      FALSE:1599     
##  chlorides       free.sulfur.dioxide total.sulfur.dioxide  density       
##  Mode :logical   Mode :logical       Mode :logical        Mode :logical  
##  FALSE:1599      FALSE:1599          FALSE:1599           FALSE:1599     
##      pH          sulphates        alcohol         quality       
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:1599      FALSE:1599      FALSE:1599      FALSE:1599

What is the structure of your dataset?

This dataset has 1,599 entries for red wine with no null values, with columns for individual chemical composition values, some of which have outlier values that produce data skew, and notably the quality of the wine is largely composed of those who values are 5-6, which seems narrow given the 0-10 range for quality values.

What is/are the main feature(s) of interest in your dataset?

Analyzing the individual factors to see which are most significant in impacting the quality of the wine will be of interest, however the narrow range in values for quality in the dataset may hinder the analysis in providing substantive insight into how chemical composition impacts wine quality overall.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Because there are numerous variables to analyze, looking for initial impacts on quality, multivariable analysis and boxing or wrapping on plots may provide decent insight into the main factors affecting quality.

Did you create any new variables from existing variables in the dataset?

There are plenty of factors to examine and thus any new variables would only make this analysis more complicated probably.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Several factors such as residual sugar, chlorides, sulfur dioxide and sulfates had outliers which could skew data when plotting those values, so those require attention when creating plots, and the skew on quality is to be considered when analyzing for relationships.

Bivariate Plots Section

## Fixed acidity and Quality correlation: 0.124051649113224

## Volatile acidity and Quality correlation: -0.390557780264007

## Citric acid and Quality correlation: 0.226372514318041

Fixed acidity does not appear to be much of a factor on quality, however volatile acidity seems to have some correlation, confirmed with a -0.39 Pearson coefficient calculation, citric acid also appears to have a correlation to quality but weaker according to the Pearson calculation of 0.23.

## Residual sugar and Quality correlation: 0.0137316373400663

Residual sugar does not appear to be much of a factor with regards to correlating with quality, and by a Pearson coefficient calculation, has none.

## Chlorides and Quality correlation: -0.128906559930053

Chlorides do not appear to have correlation with quality, and the Pearson correlation calculation is weak at -0.13.

## Free sulfur dioxide and Quality correlation: -0.0506560572442764

## Total sulfur dioxide and Quality correlation: -0.185100288926538

Free and total sulfur dioxide have a peculiar spiking pattern versus quality, and though total sulfur dioxide has some correlation to quality according to the Pearson coefficient calculation, they will be ignored due to the appearance of their data distributions and presumed weak correlation.

## Density and Quality correlation: -0.174919227783349

Density appears to have some correlation with quality, but is weak.

## pH and Quality correlation: -0.0577313912053821

pH does not appear to correlate with quality, and has none according to the Pearson coefficient calculation.

## Sulphates and Quality correlation: 0.251397079069261

Sulphates appear to correlative positively with quality somewhat, and according to Pearson coefficient calculation, is not very weak at 0.25.

## Alcohol and Quality correlation: 0.476166324001136

Alcohol appears to have a positive correlation with quality, and is not weak by the Pearson calculation of 0.48.

## `geom_smooth()` using method = 'gam'

## [1] 0.6717034

Citric acid and fixed acidity have a strong correlation with each other of 0.67.

## `geom_smooth()` using method = 'gam'

## Citric acid and volatile acidity correlation: -0.55249568455958

Citric acid also has a decent correlation with volatile acidity of -0.56.

## Fixed acidity and volatile acidity correlation: -0.256130894770382

Fixed acidity and volatile acidity only have a -0.26 Pearson correlation coefficient, which is interesting that citric acid correlates with both of them strongly, yet fixed and volatile acidity values correlate to themselves weakly.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Variables with decent correlations to quality to consider are volatile acidity, citric acid, sulphates and alcohol, however with the exceptions of volatile acidity and alcohol, most of these correlations are rather weak (below +/-0.3), so I am going with values whose correlations are above +/- 0.2 to proceed with the analysis further, which brings in citric acid and sulphates at least.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Citric acid strongly correlates with fixed and volatile acidity, which is not really surprising since those are all measurements of acidity to some degree, however what is interesting is that fixed and volatile acidity do not correlate to each other strongly.

What was the strongest relationship you found?

The strongest and most amusing relationship was alcohol to quality, which did not seem obvious since alcohol is a basic component of wine, but I did not think initially that alcohol content per se would be such a strong factor on how a wine would be rated since it seems like such an elementary component, I would think other factors would be more impactful in differentiating how a wine is rated than just the alcohol content.

## `geom_smooth()` using method = 'loess'

## Alcohol and volatile acidity correlation: -0.202288027153256

This is a scatter plot with a smooth trend to look at alcohol versus volatile acidity by quality levels, and alcohol and volatile acidity have some correlation, and as wine quality increases, volatile acidity decreases, for about all alcohol levels.

## `geom_smooth()` using method = 'loess'

## Alcohol and fixed acidity correlation: -0.0616682706281511

This scatter plot with a smooth trend to look at alcohol versus fixed acidity by quality level shows an opposite relationship compared to volatile acidity, higher quality wines had higher fixed acidity, no matter the alcohol volume. Alcohol and fixed acidity do not trend however.

## `geom_smooth()` using method = 'loess'

## Alcohol and citric acid correlation: 0.109903246641567

This scatter plot with a smooth trend to look at alcohol versus fixed acidity by quality level shows that higher quality wines had higher concentrations of citric acid, alcohol and citric acid do not have much of a correlation otherwise however.

## `geom_smooth()` using method = 'loess'

## Citric acid and volatile acidity correlation: -0.55249568455958

This scatter plot with a smooth trend to look at citric acid versus volatile acidity by quality level shows a strong negative correlation between the two.

## `geom_smooth()` using method = 'loess'

## Citric acid and fixed acidity correlation: 0.671703434764106

Similar to the last plot, this scatter plot with a smooth trend to look at citric acid versus fixed acidity by quality level shows a very strong correlation, this one positive.

## `geom_smooth()` using method = 'loess'

## Alcohol and sulphates correlation: 0.0935947504104674

This scatter plot with a smooth trend to look at alcohol versus sulphates by quality level shows that higher quality wines had higher sulphates at every alcohol level, with the wines rated 8 exhibiting odd compositional trends as the exception.

Based on these multivariate plots, we can now begin to show which chemicals factor most into the quality of the wine: volatile acidity, alcohol, sulphates and citric acid.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

In the multivariate analysis, a more succint insight began to emerge as to what specific variable relationships impact quality. Alcohol and volatile acidity have a correlation that is not strong, but wines with higher quality had lower volatile acidity for all alcohol levels up until 12% alcohol per volume. Alcohol and fixed acidity had almost no correlation, but higher quality wines had higher fixed acidity. Alcohol and citric acid had almost no correlation, but higher quality wines had higher citric acid for about all alcohol values. Citric acid and volatile acidity had a notable correlation of -0.55, and citric acid and fixed acidity had an even stronger correlation of 0.67, higher quality wines had lower fixed acidity for most citric acid levels. Alcohol and sulphates had almost no correlation, but higher quality wines had higher sulphates.

Were there any interesting or surprising interactions between features?

From the interactions seen from the plots, an assumption can be made that alcohol and volatile acidity have outsized proportionality of correlation to quality compared to other values with smaller correlation coefficients, while factors like fixed acidity, which had little correlation to quality, correlates strongly with citric acid and perhaps other factors, which is peculiar.

m1 <- lm(I(quality) ~ I(alcohol), data = subset(pf, quality > 3))
m2 <- update(m1, ~ . + volatile.acidity)
m3 <- update(m2, ~ . + sulphates)
m4 <- update(m3, ~ . + citric.acid)
mtable(m1, m2, m3, m4, sdigits = 3)

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = subset(pf, quality > 
##     3))
## m2: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity, data = subset(pf, 
##     quality > 3))
## m3: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates, 
##     data = subset(pf, quality > 3))
## m4: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates + 
##     citric.acid, data = subset(pf, quality > 3))
## 
## ============================================================================
##                          m1            m2            m3            m4       
## ----------------------------------------------------------------------------
##   (Intercept)           1.945***      3.052***      2.566***      2.578***  
##                        (0.169)       (0.181)       (0.191)       (0.197)    
##   I(alcohol)            0.356***      0.313***      0.308***      0.308***  
##                        (0.016)       (0.016)       (0.015)       (0.015)    
##   volatile.acidity                   -1.255***     -1.093***     -1.109***  
##                                      (0.095)       (0.096)       (0.112)    
##   sulphates                                         0.682***      0.687***  
##                                                    (0.098)       (0.100)    
##   citric.acid                                                    -0.027     
##                                                                  (0.102)    
## ----------------------------------------------------------------------------
##   R-squared             0.235         0.311         0.331         0.331     
##   adj. R-squared        0.234         0.310         0.330         0.330     
##   sigma                 0.685         0.650         0.640         0.641     
##   F                   487.453       357.847       261.806       196.257     
##   p                     0.000         0.000         0.000         0.000     
##   Log-likelihood    -1651.582     -1568.495     -1544.618     -1544.583     
##   Deviance            743.785       669.931       650.097       650.069     
##   AIC                3309.165      3144.990      3099.237      3101.167     
##   BIC                3325.277      3166.473      3126.091      3133.392     
##   N                  1589          1589          1589          1589         
## ============================================================================

pf[444,]

##     fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 444            10             0.44        0.49            2.7     0.077
##     free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 444                  11                   19  0.9963 3.23      0.63
##     alcohol quality
## 444    11.6       7

thisredwine = data.frame(alcohol = 10,
                         volatile.acidity = 0.33,
                         sulphates = 1.1,
                         citric.acid = 0.33
                         )

modelEstimate = predict(m4, newdata = thisredwine,
                        interval="prediction", level = .95)

modelEstimate

##        fit      lwr      upr
## 1 6.040043 4.779783 7.300303

message('Difference of model example result and actual: ', (1 - (6.04 / 7)))

## Difference of model example result and actual: 0.137142857142857

This linear model created with a subset of the data where quality > 3 produces a fit of quality that is less than the actual by about 14%, however this dataset is affected by two primary issues, quality is highly concentrated in values of 5 and 6, and no values have very strong correlation coefficients with quality individually, so any linear model is going to be affected by this skew in the data, as well as the lack of strong correlating values, but the actual value is within the fit and upper range of the model created, so it has some validity on its own if understood with that reservation.

Final Plots and Summary

Plot One

Description One

This plot displays quality box plots versus volatile acidity with alcohol by color, what is worth noticing is that wines have lower volatile acidity and higher alcohol content as quality increases. Therefore, volatile acidity and alcohol are affecting the wine quality, apparently.

Plot Two

Description Two

This plot displays quality box plots versus sulphates with alcohol by color, and sulphates and alcohol content increase as quality increases. Based on this trend, sulphates are affecting the quality of the wine.

Plot Three

Description Three

This plot displays quality box plots versus citric acid with alcohol in color, and higher quality wines had higher citric acid and alcohol content. Based on the trend seen here, citric acid is affecting the quality of the wine.

Reflection

Even though this was not a complicated dataset, there was a good amount of work necessary to understand what values in the dataset appeared to have some impact on quality, which was difficult to see initially since there are no values with a strong correlation with quality visible immediately, as seen above in a Spearman correlation coefficient matrix. Initial plots and correlation calculations produced some ideas of how to proceed with certain factors that had acceptable correlation to quality, and after some of those intial plots were done, some insight arose; alcohol, volatile acidity, sulphates, and citric acid were seen to have some correlations to quality and therefore were apparently impactful upon the quality of the wine, and thus when they are analyzed together in multivariable analysis with some box plotting for quality, a relationship began to emerge whereby wines of higher quality had higher alcohol content, lower volatile acidity, higher sulphates and higher citric acid. This insight took some time to arrive at, but by proceeding through single variable, bivariable then multivarable analysis, this assumption was able to be established.

An issue found in this dataset and an improvement that would make this analysis more insightful would be to have more data on a larger variety of wine quality so as to fill out the data better, on a quality scale of 0 to 10 this dataset has mostly wines concentrated in the quality of 5 and 6, therefore determining what values affected quality is has the reservation of being too narrow because of that dataskew, having more data on lesser and finer wines could provide more effective insight for determining what relationships of chemical components are most impactful on quality overall, beyond just wines rated 5-6 out of 10.

Red Wine Quality by Alex Raya

Univariate Plots Section

Univariate Analysis

What is the structure of your dataset?

What is/are the main feature(s) of interest in your dataset?

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Did you create any new variables from existing variables in the dataset?

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Bivariate Plots Section

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What was the strongest relationship you found?

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

Final Plots and Summary

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three

Reflection

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?