## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
This is a Udacity dataset on the chemical properties of red wine with 11 variables, along with a quality column from 3 judges that rates the wine from 0 (very bad) to 10 (excellent), and this exploratory analysis is to better understand what chemical properties are most impactful upon the quality of red wine.
This is an intial plot displaying the distribution of fixed acidity in the dataset.
This plot displays the distribution of volatile acidity in the dataset, which looks fairly familiar to the distributive appearance for fixed acidity.
This plot displays the distribution of citric.acid in the dataset, which shows the existence of many values = 0.
This plot for residual sugar shows that sugar has many outlier values that skew the data, beyond about the value of 4.
This plot for chlorides displays a data pattern similar to residual sugar where there are lots of outlier values that skew the data, a good limit being about 0.15 or so to eliminate outliers.
These plots for free and total sulfur dioxide also display skews due to outliers, beyond about a limit of 40 for free and 150 for total sulfur dioxide.
The plot for density displays a normalized standard distribution, but the variance for most values is very small.
This plot for density also displays a normalized standard distribution, but there are variances in the values to warrant analysis.
The plot for sulphates displays large outliers like the other plots, beyond a limit of about 1.
This plot for alcohol displays most values being between about 8 and 14.
From the plot for quality an interesting aspect of this dataset is displayed, despite being on a scale of 0 to 10, quality is almost entirely concentrated within the values of 5 and 6, with some wines rating 7, very few at 8 or lower than 5, and none higher than 8. This skew may impact what the outcome of the analysis is.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:1599 FALSE:1599 FALSE:1599 FALSE:1599
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:1599 FALSE:1599 FALSE:1599 FALSE:1599
## pH sulphates alcohol quality
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:1599 FALSE:1599 FALSE:1599 FALSE:1599
This dataset has 1,599 entries for red wine with no null values, with columns for individual chemical composition values, some of which have outlier values that produce data skew, and notably the quality of the wine is largely composed of those who values are 5-6, which seems narrow given the 0-10 range for quality values.
Analyzing the individual factors to see which are most significant in impacting the quality of the wine will be of interest, however the narrow range in values for quality in the dataset may hinder the analysis in providing substantive insight into how chemical composition impacts wine quality overall.
Because there are numerous variables to analyze, looking for initial impacts on quality, multivariable analysis and boxing or wrapping on plots may provide decent insight into the main factors affecting quality.
There are plenty of factors to examine and thus any new variables would only make this analysis more complicated probably.
Several factors such as residual sugar, chlorides, sulfur dioxide and sulfates had outliers which could skew data when plotting those values, so those require attention when creating plots, and the skew on quality is to be considered when analyzing for relationships.
## Fixed acidity and Quality correlation: 0.124051649113224
## Volatile acidity and Quality correlation: -0.390557780264007
## Citric acid and Quality correlation: 0.226372514318041
Fixed acidity does not appear to be much of a factor on quality, however volatile acidity seems to have some correlation, confirmed with a -0.39 Pearson coefficient calculation, citric acid also appears to have a correlation to quality but weaker according to the Pearson calculation of 0.23.
## Residual sugar and Quality correlation: 0.0137316373400663
Residual sugar does not appear to be much of a factor with regards to correlating with quality, and by a Pearson coefficient calculation, has none.
## Chlorides and Quality correlation: -0.128906559930053
Chlorides do not appear to have correlation with quality, and the Pearson correlation calculation is weak at -0.13.
## Free sulfur dioxide and Quality correlation: -0.0506560572442764
## Total sulfur dioxide and Quality correlation: -0.185100288926538
Free and total sulfur dioxide have a peculiar spiking pattern versus quality, and though total sulfur dioxide has some correlation to quality according to the Pearson coefficient calculation, they will be ignored due to the appearance of their data distributions and presumed weak correlation.
## Density and Quality correlation: -0.174919227783349
Density appears to have some correlation with quality, but is weak.
## pH and Quality correlation: -0.0577313912053821
pH does not appear to correlate with quality, and has none according to the Pearson coefficient calculation.
## Sulphates and Quality correlation: 0.251397079069261
Sulphates appear to correlative positively with quality somewhat, and according to Pearson coefficient calculation, is not very weak at 0.25.
## Alcohol and Quality correlation: 0.476166324001136
Alcohol appears to have a positive correlation with quality, and is not weak by the Pearson calculation of 0.48.
## `geom_smooth()` using method = 'gam'
## [1] 0.6717034
Citric acid and fixed acidity have a strong correlation with each other of 0.67.
## `geom_smooth()` using method = 'gam'
## Citric acid and volatile acidity correlation: -0.55249568455958
Citric acid also has a decent correlation with volatile acidity of -0.56.
## Fixed acidity and volatile acidity correlation: -0.256130894770382
Fixed acidity and volatile acidity only have a -0.26 Pearson correlation coefficient, which is interesting that citric acid correlates with both of them strongly, yet fixed and volatile acidity values correlate to themselves weakly.
Variables with decent correlations to quality to consider are volatile acidity, citric acid, sulphates and alcohol, however with the exceptions of volatile acidity and alcohol, most of these correlations are rather weak (below +/-0.3), so I am going with values whose correlations are above +/- 0.2 to proceed with the analysis further, which brings in citric acid and sulphates at least.
Citric acid strongly correlates with fixed and volatile acidity, which is not really surprising since those are all measurements of acidity to some degree, however what is interesting is that fixed and volatile acidity do not correlate to each other strongly.
The strongest and most amusing relationship was alcohol to quality, which did not seem obvious since alcohol is a basic component of wine, but I did not think initially that alcohol content per se would be such a strong factor on how a wine would be rated since it seems like such an elementary component, I would think other factors would be more impactful in differentiating how a wine is rated than just the alcohol content.
## `geom_smooth()` using method = 'loess'
## Alcohol and volatile acidity correlation: -0.202288027153256
This is a scatter plot with a smooth trend to look at alcohol versus volatile acidity by quality levels, and alcohol and volatile acidity have some correlation, and as wine quality increases, volatile acidity decreases, for about all alcohol levels.
## `geom_smooth()` using method = 'loess'
## Alcohol and fixed acidity correlation: -0.0616682706281511
This scatter plot with a smooth trend to look at alcohol versus fixed acidity by quality level shows an opposite relationship compared to volatile acidity, higher quality wines had higher fixed acidity, no matter the alcohol volume. Alcohol and fixed acidity do not trend however.
## `geom_smooth()` using method = 'loess'
## Alcohol and citric acid correlation: 0.109903246641567
This scatter plot with a smooth trend to look at alcohol versus fixed acidity by quality level shows that higher quality wines had higher concentrations of citric acid, alcohol and citric acid do not have much of a correlation otherwise however.
## `geom_smooth()` using method = 'loess'
## Citric acid and volatile acidity correlation: -0.55249568455958
This scatter plot with a smooth trend to look at citric acid versus volatile acidity by quality level shows a strong negative correlation between the two.
## `geom_smooth()` using method = 'loess'
## Citric acid and fixed acidity correlation: 0.671703434764106
Similar to the last plot, this scatter plot with a smooth trend to look at citric acid versus fixed acidity by quality level shows a very strong correlation, this one positive.
## `geom_smooth()` using method = 'loess'
## Alcohol and sulphates correlation: 0.0935947504104674
This scatter plot with a smooth trend to look at alcohol versus sulphates by quality level shows that higher quality wines had higher sulphates at every alcohol level, with the wines rated 8 exhibiting odd compositional trends as the exception.
Based on these multivariate plots, we can now begin to show which chemicals factor most into the quality of the wine: volatile acidity, alcohol, sulphates and citric acid.
In the multivariate analysis, a more succint insight began to emerge as to what specific variable relationships impact quality. Alcohol and volatile acidity have a correlation that is not strong, but wines with higher quality had lower volatile acidity for all alcohol levels up until 12% alcohol per volume. Alcohol and fixed acidity had almost no correlation, but higher quality wines had higher fixed acidity. Alcohol and citric acid had almost no correlation, but higher quality wines had higher citric acid for about all alcohol values. Citric acid and volatile acidity had a notable correlation of -0.55, and citric acid and fixed acidity had an even stronger correlation of 0.67, higher quality wines had lower fixed acidity for most citric acid levels. Alcohol and sulphates had almost no correlation, but higher quality wines had higher sulphates.
From the interactions seen from the plots, an assumption can be made that alcohol and volatile acidity have outsized proportionality of correlation to quality compared to other values with smaller correlation coefficients, while factors like fixed acidity, which had little correlation to quality, correlates strongly with citric acid and perhaps other factors, which is peculiar.
m1 <- lm(I(quality) ~ I(alcohol), data = subset(pf, quality > 3))
m2 <- update(m1, ~ . + volatile.acidity)
m3 <- update(m2, ~ . + sulphates)
m4 <- update(m3, ~ . + citric.acid)
mtable(m1, m2, m3, m4, sdigits = 3)
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = subset(pf, quality >
## 3))
## m2: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity, data = subset(pf,
## quality > 3))
## m3: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates,
## data = subset(pf, quality > 3))
## m4: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates +
## citric.acid, data = subset(pf, quality > 3))
##
## ============================================================================
## m1 m2 m3 m4
## ----------------------------------------------------------------------------
## (Intercept) 1.945*** 3.052*** 2.566*** 2.578***
## (0.169) (0.181) (0.191) (0.197)
## I(alcohol) 0.356*** 0.313*** 0.308*** 0.308***
## (0.016) (0.016) (0.015) (0.015)
## volatile.acidity -1.255*** -1.093*** -1.109***
## (0.095) (0.096) (0.112)
## sulphates 0.682*** 0.687***
## (0.098) (0.100)
## citric.acid -0.027
## (0.102)
## ----------------------------------------------------------------------------
## R-squared 0.235 0.311 0.331 0.331
## adj. R-squared 0.234 0.310 0.330 0.330
## sigma 0.685 0.650 0.640 0.641
## F 487.453 357.847 261.806 196.257
## p 0.000 0.000 0.000 0.000
## Log-likelihood -1651.582 -1568.495 -1544.618 -1544.583
## Deviance 743.785 669.931 650.097 650.069
## AIC 3309.165 3144.990 3099.237 3101.167
## BIC 3325.277 3166.473 3126.091 3133.392
## N 1589 1589 1589 1589
## ============================================================================
pf[444,]
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 444 10 0.44 0.49 2.7 0.077
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 444 11 19 0.9963 3.23 0.63
## alcohol quality
## 444 11.6 7
thisredwine = data.frame(alcohol = 10,
volatile.acidity = 0.33,
sulphates = 1.1,
citric.acid = 0.33
)
modelEstimate = predict(m4, newdata = thisredwine,
interval="prediction", level = .95)
modelEstimate
## fit lwr upr
## 1 6.040043 4.779783 7.300303
message('Difference of model example result and actual: ', (1 - (6.04 / 7)))
## Difference of model example result and actual: 0.137142857142857
This linear model created with a subset of the data where quality > 3 produces a fit of quality that is less than the actual by about 14%, however this dataset is affected by two primary issues, quality is highly concentrated in values of 5 and 6, and no values have very strong correlation coefficients with quality individually, so any linear model is going to be affected by this skew in the data, as well as the lack of strong correlating values, but the actual value is within the fit and upper range of the model created, so it has some validity on its own if understood with that reservation.
This plot displays quality box plots versus volatile acidity with alcohol by color, what is worth noticing is that wines have lower volatile acidity and higher alcohol content as quality increases. Therefore, volatile acidity and alcohol are affecting the wine quality, apparently.
This plot displays quality box plots versus sulphates with alcohol by color, and sulphates and alcohol content increase as quality increases. Based on this trend, sulphates are affecting the quality of the wine.
This plot displays quality box plots versus citric acid with alcohol in color, and higher quality wines had higher citric acid and alcohol content. Based on the trend seen here, citric acid is affecting the quality of the wine.
Even though this was not a complicated dataset, there was a good amount of work necessary to understand what values in the dataset appeared to have some impact on quality, which was difficult to see initially since there are no values with a strong correlation with quality visible immediately, as seen above in a Spearman correlation coefficient matrix. Initial plots and correlation calculations produced some ideas of how to proceed with certain factors that had acceptable correlation to quality, and after some of those intial plots were done, some insight arose; alcohol, volatile acidity, sulphates, and citric acid were seen to have some correlations to quality and therefore were apparently impactful upon the quality of the wine, and thus when they are analyzed together in multivariable analysis with some box plotting for quality, a relationship began to emerge whereby wines of higher quality had higher alcohol content, lower volatile acidity, higher sulphates and higher citric acid. This insight took some time to arrive at, but by proceeding through single variable, bivariable then multivarable analysis, this assumption was able to be established.
An issue found in this dataset and an improvement that would make this analysis more insightful would be to have more data on a larger variety of wine quality so as to fill out the data better, on a quality scale of 0 to 10 this dataset has mostly wines concentrated in the quality of 5 and 6, therefore determining what values affected quality is has the reservation of being too narrow because of that dataskew, having more data on lesser and finer wines could provide more effective insight for determining what relationships of chemical components are most impactful on quality overall, beyond just wines rated 5-6 out of 10.