Skip to content

Commit b057c24

Browse files
Peter McKeeverPeter McKeever
authored andcommitted
added blog post
1 parent a7811df commit b057c24

File tree

9 files changed

+283
-0
lines changed

9 files changed

+283
-0
lines changed
79.2 KB
Loading
224 KB
Loading
151 KB
Loading
179 KB
Loading
221 KB
Loading
425 KB
Loading
Lines changed: 283 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,283 @@
1+
---
2+
title: "Create Ridgeplots in Matplotlib"
3+
date: 2020-02-15T09:50:16+01:00
4+
draft: false
5+
description: "This post details how to leverage gridspec to create ridgeplots in matplotlib"
6+
categories: ["tutorials"]
7+
displayInList: true
8+
author: Peter McKeever
9+
resources:
10+
- name: featuredImage
11+
src: "sample_output.png"
12+
params:
13+
description: "A sample ridge plot used as a feature image for this post"
14+
showOnTop: true
15+
16+
17+
---
18+
19+
# Introduction
20+
21+
This post will outline how we can leverage [gridspec](https://matplotlib.org/3.1.3/api/_as_gen/matplotlib.gridspec.GridSpec.html) to create ridgeplots in Matplotlib. While this is a relatively straightforward tutorial, some experience working with sklearn would be benefitial. Naturally it being a _vast_ undertaking, this will not be an sklearn tutorial, those interested can read through the docs [here](https://scikit-learn.org/stable/user_guide.html). However, we will be making use if its `KernelDensity` module from `sklearn.neghbors`.
22+
23+
<!--
24+
# Contents
25+
- [Packages](#packages)
26+
- [Data](#data)
27+
- [GridSpec](gs1)
28+
- [Kernel Density Estimation](#kde)
29+
- [Overlapping Axes Objects](#gs2)
30+
- [Complete Snippet](#snippet)
31+
-->
32+
33+
34+
### Packages <a id="packages"></a>
35+
36+
37+
38+
````
39+
import pandas as pd
40+
import numpy as np
41+
from sklearn.neighbors import KernelDensity
42+
43+
import matplotlib as mpl
44+
import matplotlib.pyplot as plt
45+
import matplotlib.gridspec as grid_spec
46+
````
47+
48+
### Data <a id="data"></a>
49+
50+
I'll be using some mock data I created. You can grab the dataset from GitHub [here](https://github.com/petermckeever/mock-data/blob/master/datasets/mock-european-test-results.csv) if you want to play along. The data looks at aptitude test scores broken down by country, age, and sex.
51+
52+
````
53+
data = pd.read_csv("mock-european-test-results.csv")
54+
````
55+
56+
57+
58+
| country | age | sex | score |
59+
| ---- |----|----| ---- |
60+
| Italy | 21 | female | 0.77 |
61+
| Spain | 20 | female | 0.87 |
62+
| Italy | 24 | female | 0.39 |
63+
| United Kingdom | 20 | female | 0.70 |
64+
| Germany | 20 | male | 0.25 |
65+
| ... | | | |
66+
67+
68+
69+
### GridSpec <a id="gs1"></a>
70+
GridSpec is a matplotlib module that allows us easy creation of subplots. We can control the number of subplots, the positions, the height, width, and spacing between each. As a basic example, lets create a quick template. The key parameters we'll be focusing on are `nrows`, `ncols`, and `width_ratios`.
71+
72+
`nrows`and `ncols` divides our figure into areas we can add axes to. `width_ratios`controls the width of each of our columns. If we create something like `GridSpec(2,2,width_ratios=[2,1])`, we are subsetting our figure into 2 rows, 2 columns, and setting our width ratio to 2:1, i.e, that the first column will take up two times the width of the figure.
73+
74+
What's great about GridSpec is that now we have created those subsets, we are not _bound_ to them, as we will see below.
75+
76+
**Note**: I am using my own theme, so plots will look different. Creating custom themes is outside the scope of this tutorial (but I may write one).
77+
78+
79+
````
80+
gs = (grid_spec
81+
.GridSpec(2,2,width_ratios=[2,1]))
82+
83+
fig = plt.figure(figsize=(8,6))
84+
85+
ax = fig.add_subplot(gs[0:1,0])
86+
ax1 = fig.add_subplot(gs[1:,0])
87+
ax2 = fig.add_subplot(gs[0:,1:])
88+
89+
ax_objs = [ax,ax1,ax2]
90+
91+
n = ["",1,2]
92+
93+
ax_objs = [ax,ax1,ax2]
94+
n = ["",1,2]
95+
96+
i = 0
97+
for ax_obj in ax_objs:
98+
ax_obj.text(0.5,0.5,"ax{}".format(n[i]),
99+
ha="center",color="red",
100+
fontweight="bold",size=20)
101+
i += 1
102+
103+
plt.show()
104+
````
105+
![](basic_template.png)
106+
107+
I won't get into more detail about what everything does here. If you are interested in learning more about figures, axes , and gridspec; Akash Palrecha has [written a very nice guide here](https://matplotlib.org/matplotblog/posts/an-inquiry-into-matplotlib-figures/).
108+
109+
### Kernel Density Estimation <a id="kde"></a>
110+
111+
We have a couple of options here. By far the easiest is to stick with the pipes built in to pandas. All that's needed is to select the column and add `plot.kde`. This defaults to a Scott bandwidth method, but you can choose a Silverman method, or add your own. Let's use GridSpec again to plot the distribution for each country. First we'll grab the unique country names and create a list of colors.
112+
113+
````
114+
countries = [x for x in np.unique(data.country)]
115+
colors = ['#0000ff', '#3300cc', '#660099', '#990066', '#cc0033', '#ff0000']
116+
````
117+
Next we'll loop through each country and color to plot our data. Unlike the above we will not explicitly declare how many rows we want to plot. The reason for this is to make our code more dynamic. If we set a specific number of rows and specific number of axes objects, we're creating inefficient code. This is a bit of an aside, but when creating visualisations, you should always aim to reduce and reuse. By reduce, we specifically mean lessening the number of variables we are declaring and the unnecessary code associated with that. We are plotting data for six countries, what happens if we get data for 20 countries? That's a lot of additional code. Related, by not explicitly declaring those variables we make our code adaptable and ready to be scripted to automatically create new plots when new data of the same kind becomes available.
118+
119+
120+
121+
122+
````
123+
124+
gs = (grid_spec
125+
.GridSpec(len(countries),1)
126+
)
127+
128+
fig = plt.figure(figsize=(8,6))
129+
130+
i = 0
131+
132+
#creating empty list
133+
ax_objs = []
134+
135+
for country in countries:
136+
# creating new axes object and appending to ax_objs
137+
ax_objs.append(fig.add_subplot(gs[i:i+1, 0:]))
138+
139+
# plotting the distribution
140+
plot = (data[data.country == country]
141+
.score.plot.kde(ax=ax_objs[-1],color="#f0f0f0", lw=0.5)
142+
)
143+
144+
# grabbing x and y data from the kde plot
145+
x = plot.get_children()[0]._x
146+
y = plot.get_children()[0]._y
147+
148+
# filling the space beneath the distribution
149+
ax_objs[-1].fill_between(x,y,color=colors[i])
150+
151+
# setting uniform x and y lims
152+
ax_objs[-1].set_xlim(0, 1)
153+
ax_objs[-1].set_ylim(0,2.2)
154+
155+
i += 1
156+
157+
plt.tight_layout()
158+
plt.show()
159+
````
160+
161+
![](grid_spec_distro.png)
162+
163+
We're not quite at ridge plots yet, but let's look at what's going on here. You'll notice instead of setting an explicit number of rows, we've set it to the length of our countries list - `gs = (grid_spec.GridSpec(len(countries),1))`. This gives us flexibility for future plotting with the ability to plot more or less countries without needing to adjust the code.
164+
165+
Just after the for loop we create each axes object: `ax_objs.append(fig.add_subplot(gs[i:i+1, 0:]))`. Before the loop we declared i = 0. Here we are saying create axes object from row 0 to 1, the next time the loop runs it creates an axes object from row 1 to 2, then 2 to 3, 3 to 4, and so on.
166+
167+
Following this we can use `ax_objs[-1]` to access the last created axes object to use as our plotting area.
168+
169+
Next, we create the kde plot. We declare this as a variable so we can retrieve the x and y values to use in the `fill_between` that follows.
170+
171+
172+
### Overlapping Axes Objects <a id="gs2"></a>
173+
174+
Once again using GridSpec, we can adjust the spacing between each of the subplots. We can do this by adding one line outside of the loop before `plt.tight_layout()`The exact value will depend on your distribution so feel free to play around with the exact value:
175+
176+
````
177+
gs.update(hspace= -0.5)
178+
````
179+
![](grid_spec_distro_overlap_1.png)
180+
181+
Now our axes objects are overlapping! Great-ish. Each axes object is hiding the one layered below it. We _could_ just add `ax_objs[-1].axis("off)` to our for loop, but if we do that we will lose our xticklabels. Instead we will create a variable to access the background of each axes object, and we will loop through each line of the border (spine) to turn them off. As we _only_ need the xticklabels for the final plot, we will add an if statement to handle that. We will also add in our country labels here. In our for loop we add:
182+
183+
````
184+
185+
# make background transparent
186+
rect = ax_objs[-1].patch
187+
rect.set_alpha(0)
188+
189+
# remove borders, axis ticks, and labels
190+
ax_objs[-1].set_yticklabels([])
191+
ax_objs[-1].set_ylabel('')
192+
193+
if i == len(countries)-1:
194+
pass
195+
else:
196+
ax_objs[-1].set_xticklabels([])
197+
198+
spines = ["top","right","left","bottom"]
199+
for s in spines:
200+
ax_objs[-1].spines[s].set_visible(False)
201+
202+
country = country.replace(" ","\n")
203+
ax_objs[-1].text(-0.02,0,country,fontweight="bold",fontsize=14,ha="center")
204+
205+
````
206+
![](grid_spec_distro_overlap_2.png)
207+
208+
As an alternative to the above, we can use the `KernelDensity` module from `sklearn.neighbors` to create our distribution. This gives us a bit more control over our bandwith. The method here is taken from Jake VanderPlas's fantastic _Python Data Science Handbook_, you can read his full excerpt [here](https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html). We can reuse most of the above code, but need to make a couple of changes. Rather than repeat myself, I'll add the full snippet here and you can see the changes and minor additions (added title, label to xaxis).
209+
210+
### Complete Plot Snippet <a id="snippet"></a>
211+
212+
````
213+
countries = [x for x in np.unique(data.country)]
214+
colors = ['#0000ff', '#3300cc', '#660099', '#990066', '#cc0033', '#ff0000']
215+
216+
gs = grid_spec.GridSpec(len(countries),1)
217+
fig = plt.figure(figsize=(16,9))
218+
219+
i = 0
220+
221+
ax_objs = []
222+
for country in countries:
223+
country = countries[i]
224+
x = np.array(data[data.country == country].score)
225+
x_d = np.linspace(0,1, 1000)
226+
227+
kde = KernelDensity(bandwidth=0.03, kernel='gaussian')
228+
kde.fit(x[:, None])
229+
230+
logprob = kde.score_samples(x_d[:, None])
231+
232+
# creating new axes object
233+
ax_objs.append(fig.add_subplot(gs[i:i+1, 0:]))
234+
235+
# plotting the distribution
236+
ax_objs[-1].plot(x_d, np.exp(logprob),color="#f0f0f0",lw=1)
237+
ax_objs[-1].fill_between(x_d, np.exp(logprob), alpha=1,color=colors[i])
238+
239+
240+
# setting uniform x and y lims
241+
ax_objs[-1].set_xlim(0,1)
242+
ax_objs[-1].set_ylim(0,2.5)
243+
244+
# make background transparent
245+
rect = ax_objs[-1].patch
246+
rect.set_alpha(0)
247+
248+
# remove borders, axis ticks, and labels
249+
ax_objs[-1].set_yticklabels([])
250+
251+
if i == len(countries)-1:
252+
ax_objs[-1].set_xlabel("Test Score", fontsize=16,fontweight="bold")
253+
else:
254+
ax_objs[-1].set_xticklabels([])
255+
256+
spines = ["top","right","left","bottom"]
257+
for s in spines:
258+
ax_objs[-1].spines[s].set_visible(False)
259+
260+
adj_country = country.replace(" ","\n")
261+
ax_objs[-1].text(-0.02,0,adj_country,fontweight="bold",fontsize=14,ha="right")
262+
263+
264+
i += 1
265+
266+
gs.update(hspace=-0.7)
267+
268+
fig.text(0.07,0.85,"Distribution of Aptitude Test Results from 18 – 24 year-olds",fontsize=20)
269+
270+
plt.tight_layout()
271+
plt.show()
272+
````
273+
![](grid_spec_distro_overlap_3.png)
274+
275+
276+
I'll finish this off with a little project to put the above code into practice. The data provided also contains information on whether the test taker was male or female. Using the above code as a template, see how you get on creating something like this:
277+
278+
![](split_ridges.png)
279+
280+
For those more ambitious, this could be turned into a split violin plot with males on one side and females on the other. Is there a way to combine the ridge and violin plot?
281+
282+
I'd love to see what people come back with so if you do create something, send it to me on twitter [here](http://twitter.com/petermckeever)!
283+
425 KB
Loading
579 KB
Loading

0 commit comments

Comments
 (0)