If you would like to follow along, you can find the python notebook on my github here.
About the Data Source
The dataset being explored here is from whiskyanalysis.com. It covers the Cost, Country, Type, and Class of an assortment of whiskies as well as the Cluster and Super Cluster of thier flavor profiles. It also includes the normalized Meta Critic score (as defined on his methodology page), the Standard Deviation of the normalized score, and the number of reviews that contributed to that score.
As stated on the webpage for the data itself: “The ultimate goal of this site is really one of arbitrage – taking advantage of the price differential between two or more whiskies of varying quality. You do this by comparing whiskies prices to whisky quality, among whiskies of similar flavour. The goal of this site is to objectively define those two key characteristics for you.” This is handled via the normalization of Meta Critic scores for each whisky (as defined above), as well as providing key qualities for each whisky.
Class, Country, and Type give an idea towards overall profile qualities, while Cluster and Super Cluster are specific flavor profiles.
Cost values are representative of a range of costs in CAD. These costs can vary depending on the vendor, but generally can be found within the price range listed. The Cost value ranges are <$30, $30-$50, $50-$70, $70-$125, $125-$300, and >$300. These, once cleaned, are represented with a letter rating from A to F, where A is the most expensive and F is the least.
For more information on variables and their meanings, see the how to read the database page on the source site.
Variables Kept and Variables Discarded
While there is a lot of good data in this dataset, I decided to focus on only a few of the variables presented. I wanted to make most of my comparisons against the Meta Critic ratings of the whiskies, but I was less interested in their Name, Number of Reviews, or Standard Deviation. I determined that Cost and Country are very valuable for such a comparison, as are Class and Type; these four variables I kept. Cluster was also usable, but Super Cluster seemed superflous for what I was doing seeing as how it was just a combination of flavor-adjacent Clusters.
Variables Kept: Meta Critic, Cost, Class, Cluster, Country, Type
Variables Discarded: Whisky, STDEV, #, Super Cluster
# import libraries and read in dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
whisky_df = pd.read_csv("Whisky_Analysis.csv")
In []:
# Check for missing values
whisky_df.isnull().sum()
Out []:
Whisky 0
Meta Critic 0
STDEV 0
# 0
Cost 2
Class 0
Super Cluster 570
Cluster 302
Country 0
Type 0
dtype: int64
In [ ]:
whisky_df['Cost'].isnull().sum()
Out[ ]:
2
Null Values
There is a significant number of missing values for the Cluster variable. While I could have dropped these values, I instead decided to give them a new result of “U” for “Unknown.” This will let me still use that data for other variable considerations, such as comparing Meta Critic to Cost.
We also have two missing Cost values, but these can simply be ignored or dropped. I opted to keep them in just in case they became useful.
In [ ]:
# Replace all NaN values for Cluster column with "U" for unknown
whisky_df["Cluster"].fillna("U", inplace = True)
In [ ]:
# Check that the number of "U" values are equal to the previous
# number of .isnull().sum() results for Cluster
whisky_df['Cluster'].value_counts().sort_index()
Out[ ]:
A 99
B 50
C 218
E 224
F 42
G 142
H 72
I 177
J 122
R0 28
R1 67
R2 67
R3 46
R4 60
U 302
Name: Cluster, dtype: int64
In [ ]:
# Double checking is never bad. Check new isnull().sum()
whisky_df['Cluster'].isnull().sum()
Out[ ]:
0
Restructuring the Data
Some of the data needed to be changed into an easier form to work with. Cost was an ideal candidate for this due to the unwieldyness of the “$” values it used. I opted to use a numerical “grade” instead, where “A” represented the most expensive value band (>300 CAD) and “F” represented the lowest (<30 CAD).
Country was another that could use some restructuring. There are significantly more Scotland entries than even the runner up, USA. In fact, anything below Ireland was almost neglible in comparison. Because of this, and considering the domain knowledge of general expected Country types, I combined everything from Sweden to France under the new entry “Other”.
In [ ]:
# Checking the value counts for the most prime candidate to be restructured, Cost
whisky_df['Cost'].value_counts().sort_index()
Out[ ]:
$ 87
$$ 201
$$$ 324
$$$$ 580
$$$$$ 343
$$$$$+ 179
Name: Cost, dtype: int64
In [ ]:
# Checking the number of entries per country.
whisky_df['Country'].value_counts()
Out[ ]:
Scotland 928
USA 292
Canada 189
Ireland 81
Japan 75
Sweden 59
India 38
Taiwan 16
Wales 8
Switzerland 7
Finland 7
Tasmania 5
Netherlands 4
South Africa 3
England 2
Belgium 1
France 1
Name: Country, dtype: int64
In [ ]:
# Remap "Cost" to a more usable form
whisky_df["Cost Rank"] = whisky_df['Cost'].map({"$":"F", "$$":"E", "$$$":"D",
"$$$$":"C", "$$$$$":"B",
"$$$$$+":"A"})
# Remap all other countries into one
whisky_df['Country Condensed'] = whisky_df['Country'].map({"Scotland":"Scotland",
"USA":"USA", "Canada":"Canada",
"Ireland":"Ireland", "Japan":"Japan",
"Sweden":"Other", "India":"Other",
"Taiwan":"Other", "Wales": "Other",
"Switzerland":"Other",
"Finland":"Other", "Tasmania":"Other",
"Netherlands": "Other",
"South Africa":"Other",
"England":"Other", "Belgium":"Other",
"France":"Other"})
In [ ]:
whisky_df['Country Condensed'].value_counts()
Out[ ]:
Scotland 928
USA 292
Canada 189
Other 151
Ireland 81
Japan 75
Name: Country Condensed, dtype: int64
In [ ]:
whisky_df['Cost Rank'].value_counts().sort_index()
Out[ ]:
A 179
B 343
C 580
D 324
E 201
F 87
Name: Cost Rank, dtype: int64
Determining Useful Data Comparisons
One of the most important key skills of a Data Scientist is the ability to determine what data comparisons are ideal. And key to ths is understanding what the question or problem these comparisons are answering.
In our case, we are looking at the commonalities between the highest ranking whiskies. ‘Meta Critic’, the normalized score for each whisky, becomes the variable that we will compare everything else to.
In [ ]:
pd.crosstab(whisky_df['Meta Critic'], whisky_df['Country Condensed'])
Out[ ]:
Country Condensed Canada Ireland Japan Other Scotland USA
Meta Critic
6.47 0 0 0 1 0 0
6.60 0 0 1 0 0 0
6.80 1 0 0 0 0 0
6.88 0 0 0 0 1 0
6.91 0 0 0 0 1 0
... ... ... ... ... ... ...
9.44 1 0 0 0 1 0
9.48 0 0 0 0 2 0
9.50 0 0 0 0 1 0
9.51 0 0 0 0 1 0
9.54 1 0 0 0 0 0
208 rows × 6 columns
In [ ]:
whisky_df.groupby(["Cost Rank"])["Meta Critic"].mean()
Out[ ]:
Cost Rank
A 8.961229
B 8.744927
C 8.618138
D 8.386883
E 8.224677
F 7.881609
Name: Meta Critic, dtype: float64
In [ ]:
whisky_df["Meta Critic"].groupby(whisky_df["Cost Rank"]).mean()
Out[ ]:
Cost Rank
A 8.961229
B 8.744927
C 8.618138
D 8.386883
E 8.224677
F 7.881609
Name: Meta Critic, dtype: float64
In [ ]:
whisky_df.groupby(["Cluster"])["Meta Critic"].mean()
Out[ ]:
Cluster
A 8.798788
B 8.588400
C 8.646697
E 8.528304
F 8.456190
G 8.436197
H 8.320833
I 8.713672
J 8.782213
R0 8.565357
R1 8.607761
R2 8.579851
R3 8.591522
R4 8.612833
U 8.313411
Name: Meta Critic, dtype: float64
In [ ]:
whisky_df.groupby(["Country"])["Meta Critic"].mean()
Out[ ]:
Country
Belgium 8.310000
Canada 8.411005
England 8.210000
Finland 8.488571
France 7.290000
India 8.790526
Ireland 8.415556
Japan 8.528533
Netherlands 8.347500
Scotland 8.581940
South Africa 8.186667
Sweden 8.560847
Switzerland 7.748571
Taiwan 8.785625
Tasmania 8.534000
USA 8.595788
Wales 8.101250
Name: Meta Critic, dtype: float64
In [ ]:
whisky_df.groupby(['Country Condensed'])["Meta Critic"].mean()
Out[ ]:
Country Condensed
Canada 8.411005
Ireland 8.415556
Japan 8.528533
Other 8.548411
Scotland 8.581940
USA 8.595788
Name: Meta Critic, dtype: float64
In [ ]:
whisky_df.groupby(['Type'])["Meta Critic"].mean()
Out[ ]:
Type
Barley 8.550000
Blend 8.312712
Bourbon 8.598039
Flavoured 7.360000
Grain 8.287143
Malt 8.603295
Rye 8.645696
Wheat 8.500000
Whiskey 8.310000
Name: Meta Critic, dtype: float64
In [ ]:
whisky_df['Meta Critic'].describe()
Out[ ]:
count 1716.000000
mean 8.552331
std 0.396664
min 6.470000
25% 8.330000
50% 8.600000
75% 8.830000
max 9.540000
Name: Meta Critic, dtype: float64
In [ ]:
# Let's see what a scatterplot of the two looks like
plt.scatter(whisky_df["Country Condensed"], whisky_df['Meta Critic']);
In [ ]:
# Scatter is too difficult to determine anything specific. Boxplot might be better
whisky_df.boxplot(column='Meta Critic', by='Country Condensed');
In [ ]:
# Again, with different variables
whisky_df.boxplot(column="Meta Critic", by='Cost Rank');
In [ ]:
whisky_df.boxplot(column='Meta Critic', by='Cluster');
In [ ]:
whisky_df.boxplot(column='Meta Critic', by='Class');
In [ ]:
whisky_df.boxplot(column='Meta Critic', by='Type');
In [ ]:
# Are their any useful comparisons not utilizing 'Meta Critic'?
pd.crosstab(whisky_df['Cluster'], whisky_df['Type'])
Out[ ]:
Type Barley Blend Bourbon Flavoured Grain Malt Rye Wheat Whiskey
Cluster
A 0 0 0 0 0 99 0 0 0
B 0 0 0 0 0 50 0 0 0
C 0 3 0 0 0 215 0 0 0
E 0 10 0 0 0 214 0 0 0
F 0 1 0 0 0 41 0 0 0
G 0 7 0 0 0 135 0 0 0
H 0 0 0 0 0 72 0 0 0
I 0 0 0 0 0 177 0 0 0
J 0 0 0 0 0 122 0 0 0
R0 0 1 23 0 2 0 0 1 1
R1 0 0 67 0 0 0 0 0 0
R2 0 0 67 0 0 0 0 0 0
R3 0 0 46 0 0 0 0 0 0
R4 0 2 0 0 0 0 58 0 0
U 1 271 1 1 5 1 21 1 0
In [ ]:
# What does this look like as a stacked bar chart?
flavor_type = pd.crosstab(whisky_df['Cluster'], whisky_df['Type'])
flavor_type.plot(kind='barh', stacked=True);
In [ ]:
# This next one is not really useful information because it is literally
# defined by the Cluster code definitions already
pd.crosstab(whisky_df['Cluster'], whisky_df['Class'])
Out[ ]:
Class Bourbon-like Rye-like Scotch-like SingleMalt-like
Cluster
A 0 0 0 99
B 0 0 0 50
C 0 0 0 218
E 0 0 0 224
F 0 0 0 42
G 0 0 0 142
H 0 0 0 72
I 0 0 0 177
J 0 0 0 122
R0 28 0 0 0
R1 67 0 0 0
R2 67 0 0 0
R3 46 0 0 0
R4 0 60 0 0
U 12 155 135 0
In [ ]:
# Similar to previous, but more useful
pd.crosstab(whisky_df['Type'], whisky_df['Class'])
Out[ ]:
Class Bourbon-like Rye-like Scotch-like SingleMalt-like
Type
Barley 0 1 0 0
Blend 11 134 129 21
Bourbon 204 0 0 0
Flavoured 1 0 0 0
Grain 2 0 5 0
Malt 0 0 1 1125
Rye 0 79 0 0
Wheat 1 1 0 0
Whiskey 1 0 0 0
In [ ]:
# Lets see this as a stacked bar too
# The Malt value skews the graph and makes it hard to read
type_class = pd.crosstab(whisky_df['Type'], whisky_df['Class'])
type_class.plot(kind='barh', stacked=True);
In [ ]:
pd.crosstab(whisky_df['Cluster'], whisky_df['Country Condensed'])
Out[ ]:
Country Condensed Canada Ireland Japan Other Scotland USA
Cluster
A 0 1 2 7 89 0
B 3 3 1 8 34 1
C 1 12 5 29 170 1
E 9 16 12 39 144 4
F 1 1 5 4 31 0
G 15 6 18 13 88 2
H 1 4 6 11 50 0
I 5 1 11 16 142 2
J 0 1 1 17 103 0
R0 1 0 0 0 0 27
R1 0 0 0 0 0 67
R2 0 0 0 0 0 67
R3 0 0 0 0 0 46
R4 0 0 0 0 0 60
U 153 36 14 7 77 15
In [ ]:
type_class = pd.crosstab(whisky_df['Country Condensed'], whisky_df['Cluster'])
type_class.plot(kind='barh', stacked=True);
Final Notes
What I want to do better I wanted to try looking at correlation heat maps utilizing dython, but I ran out of time.
I wasn’t able to clean up the visualizations as much as I would like.
Takeaways
Scotch Single Malts seem to be the preffered type of whisky for the data gatherer. This meant the results were weighted towards these categories.
Country of Origin has little effect on the overall score of a whisky.
The Cost of a whisky seems to have a proportional correlation to the score of a whisky. Could be that higher costing whiskies are objectively better, but could also partially be the consumer rationalizing the higher expenditure as “better”. Would be interesting to try this with a blind group who did not know the cost of the whiskies they were trying.
Scotland has a more homogenous flavor profile than I initially would have thought. In retrospect, it makes sense when considering the breadth of their sub-regions.