|
| 1 | +--- |
| 2 | +title: "Create Ridgeplots in Matplotlib" |
| 3 | +date: 2020-02-15T09:50:16+01:00 |
| 4 | +draft: false |
| 5 | +description: "This post details how to leverage gridspec to create ridgeplots in matplotlib" |
| 6 | +categories: ["tutorials"] |
| 7 | +displayInList: true |
| 8 | +author: Peter McKeever |
| 9 | +resources: |
| 10 | +- name: featuredImage |
| 11 | + src: "sample_output.png" |
| 12 | + params: |
| 13 | + description: "A sample ridge plot used as a feature image for this post" |
| 14 | + showOnTop: true |
| 15 | + |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +# Introduction |
| 20 | + |
| 21 | +This post will outline how we can leverage [gridspec](https://matplotlib.org/3.1.3/api/_as_gen/matplotlib.gridspec.GridSpec.html) to create ridgeplots in Matplotlib. While this is a relatively straightforward tutorial, some experience working with sklearn would be benefitial. Naturally it being a _vast_ undertaking, this will not be an sklearn tutorial, those interested can read through the docs [here](https://scikit-learn.org/stable/user_guide.html). However, we will be making use if its `KernelDensity` module from `sklearn.neghbors`. |
| 22 | + |
| 23 | +<!-- |
| 24 | +# Contents |
| 25 | + - [Packages](#packages) |
| 26 | + - [Data](#data) |
| 27 | + - [GridSpec](gs1) |
| 28 | + - [Kernel Density Estimation](#kde) |
| 29 | + - [Overlapping Axes Objects](#gs2) |
| 30 | + - [Complete Snippet](#snippet) |
| 31 | + --> |
| 32 | + |
| 33 | + |
| 34 | +### Packages <a id="packages"></a> |
| 35 | + |
| 36 | + |
| 37 | + |
| 38 | +```` |
| 39 | +import pandas as pd |
| 40 | +import numpy as np |
| 41 | +from sklearn.neighbors import KernelDensity |
| 42 | +
|
| 43 | +import matplotlib as mpl |
| 44 | +import matplotlib.pyplot as plt |
| 45 | +import matplotlib.gridspec as grid_spec |
| 46 | +```` |
| 47 | + |
| 48 | +### Data <a id="data"></a> |
| 49 | + |
| 50 | +I'll be using some mock data I created. You can grab the dataset from GitHub [here](https://github.com/petermckeever/mock-data/blob/master/datasets/mock-european-test-results.csv) if you want to play along. The data looks at aptitude test scores broken down by country, age, and sex. |
| 51 | + |
| 52 | +```` |
| 53 | +data = pd.read_csv("mock-european-test-results.csv") |
| 54 | +```` |
| 55 | + |
| 56 | + |
| 57 | + |
| 58 | +| country | age | sex | score | |
| 59 | +| ---- |----|----| ---- | |
| 60 | +| Italy | 21 | female | 0.77 | |
| 61 | +| Spain | 20 | female | 0.87 | |
| 62 | +| Italy | 24 | female | 0.39 | |
| 63 | +| United Kingdom | 20 | female | 0.70 | |
| 64 | +| Germany | 20 | male | 0.25 | |
| 65 | +| ... | | | | |
| 66 | + |
| 67 | + |
| 68 | + |
| 69 | +### GridSpec <a id="gs1"></a> |
| 70 | +GridSpec is a matplotlib module that allows us easy creation of subplots. We can control the number of subplots, the positions, the height, width, and spacing between each. As a basic example, lets create a quick template. The key parameters we'll be focusing on are `nrows`, `ncols`, and `width_ratios`. |
| 71 | + |
| 72 | +`nrows`and `ncols` divides our figure into areas we can add axes to. `width_ratios`controls the width of each of our columns. If we create something like `GridSpec(2,2,width_ratios=[2,1])`, we are subsetting our figure into 2 rows, 2 columns, and setting our width ratio to 2:1, i.e, that the first column will take up two times the width of the figure. |
| 73 | + |
| 74 | +What's great about GridSpec is that now we have created those subsets, we are not _bound_ to them, as we will see below. |
| 75 | + |
| 76 | +**Note**: I am using my own theme, so plots will look different. Creating custom themes is outside the scope of this tutorial (but I may write one). |
| 77 | + |
| 78 | + |
| 79 | +```` |
| 80 | +gs = (grid_spec |
| 81 | + .GridSpec(2,2,width_ratios=[2,1])) |
| 82 | +
|
| 83 | +fig = plt.figure(figsize=(8,6)) |
| 84 | +
|
| 85 | +ax = fig.add_subplot(gs[0:1,0]) |
| 86 | +ax1 = fig.add_subplot(gs[1:,0]) |
| 87 | +ax2 = fig.add_subplot(gs[0:,1:]) |
| 88 | +
|
| 89 | +ax_objs = [ax,ax1,ax2] |
| 90 | +
|
| 91 | +n = ["",1,2] |
| 92 | +
|
| 93 | +ax_objs = [ax,ax1,ax2] |
| 94 | +n = ["",1,2] |
| 95 | +
|
| 96 | +i = 0 |
| 97 | +for ax_obj in ax_objs: |
| 98 | + ax_obj.text(0.5,0.5,"ax{}".format(n[i]), |
| 99 | + ha="center",color="red", |
| 100 | + fontweight="bold",size=20) |
| 101 | + i += 1 |
| 102 | +
|
| 103 | +plt.show() |
| 104 | +```` |
| 105 | + |
| 106 | + |
| 107 | +I won't get into more detail about what everything does here. If you are interested in learning more about figures, axes , and gridspec; Akash Palrecha has [written a very nice guide here](https://matplotlib.org/matplotblog/posts/an-inquiry-into-matplotlib-figures/). |
| 108 | + |
| 109 | +### Kernel Density Estimation <a id="kde"></a> |
| 110 | + |
| 111 | +We have a couple of options here. By far the easiest is to stick with the pipes built in to pandas. All that's needed is to select the column and add `plot.kde`. This defaults to a Scott bandwidth method, but you can choose a Silverman method, or add your own. Let's use GridSpec again to plot the distribution for each country. First we'll grab the unique country names and create a list of colors. |
| 112 | + |
| 113 | +```` |
| 114 | +countries = [x for x in np.unique(data.country)] |
| 115 | +colors = ['#0000ff', '#3300cc', '#660099', '#990066', '#cc0033', '#ff0000'] |
| 116 | +```` |
| 117 | +Next we'll loop through each country and color to plot our data. Unlike the above we will not explicitly declare how many rows we want to plot. The reason for this is to make our code more dynamic. If we set a specific number of rows and specific number of axes objects, we're creating inefficient code. This is a bit of an aside, but when creating visualisations, you should always aim to reduce and reuse. By reduce, we specifically mean lessening the number of variables we are declaring and the unnecessary code associated with that. We are plotting data for six countries, what happens if we get data for 20 countries? That's a lot of additional code. Related, by not explicitly declaring those variables we make our code adaptable and ready to be scripted to automatically create new plots when new data of the same kind becomes available. |
| 118 | + |
| 119 | + |
| 120 | + |
| 121 | + |
| 122 | +```` |
| 123 | +
|
| 124 | +gs = (grid_spec |
| 125 | + .GridSpec(len(countries),1) |
| 126 | + ) |
| 127 | +
|
| 128 | +fig = plt.figure(figsize=(8,6)) |
| 129 | +
|
| 130 | +i = 0 |
| 131 | +
|
| 132 | +#creating empty list |
| 133 | +ax_objs = [] |
| 134 | +
|
| 135 | +for country in countries: |
| 136 | + # creating new axes object and appending to ax_objs |
| 137 | + ax_objs.append(fig.add_subplot(gs[i:i+1, 0:])) |
| 138 | + |
| 139 | + # plotting the distribution |
| 140 | + plot = (data[data.country == country] |
| 141 | + .score.plot.kde(ax=ax_objs[-1],color="#f0f0f0", lw=0.5) |
| 142 | + ) |
| 143 | + |
| 144 | + # grabbing x and y data from the kde plot |
| 145 | + x = plot.get_children()[0]._x |
| 146 | + y = plot.get_children()[0]._y |
| 147 | + |
| 148 | + # filling the space beneath the distribution |
| 149 | + ax_objs[-1].fill_between(x,y,color=colors[i]) |
| 150 | + |
| 151 | + # setting uniform x and y lims |
| 152 | + ax_objs[-1].set_xlim(0, 1) |
| 153 | + ax_objs[-1].set_ylim(0,2.2) |
| 154 | + |
| 155 | + i += 1 |
| 156 | + |
| 157 | +plt.tight_layout() |
| 158 | +plt.show() |
| 159 | +```` |
| 160 | + |
| 161 | + |
| 162 | + |
| 163 | +We're not quite at ridge plots yet, but let's look at what's going on here. You'll notice instead of setting an explicit number of rows, we've set it to the length of our countries list - `gs = (grid_spec.GridSpec(len(countries),1))`. This gives us flexibility for future plotting with the ability to plot more or less countries without needing to adjust the code. |
| 164 | + |
| 165 | +Just after the for loop we create each axes object: `ax_objs.append(fig.add_subplot(gs[i:i+1, 0:]))`. Before the loop we declared i = 0. Here we are saying create axes object from row 0 to 1, the next time the loop runs it creates an axes object from row 1 to 2, then 2 to 3, 3 to 4, and so on. |
| 166 | + |
| 167 | +Following this we can use `ax_objs[-1]` to access the last created axes object to use as our plotting area. |
| 168 | + |
| 169 | +Next, we create the kde plot. We declare this as a variable so we can retrieve the x and y values to use in the `fill_between` that follows. |
| 170 | + |
| 171 | + |
| 172 | +### Overlapping Axes Objects <a id="gs2"></a> |
| 173 | + |
| 174 | +Once again using GridSpec, we can adjust the spacing between each of the subplots. We can do this by adding one line outside of the loop before `plt.tight_layout()`The exact value will depend on your distribution so feel free to play around with the exact value: |
| 175 | + |
| 176 | +```` |
| 177 | +gs.update(hspace= -0.5) |
| 178 | +```` |
| 179 | + |
| 180 | + |
| 181 | +Now our axes objects are overlapping! Great-ish. Each axes object is hiding the one layered below it. We _could_ just add `ax_objs[-1].axis("off)` to our for loop, but if we do that we will lose our xticklabels. Instead we will create a variable to access the background of each axes object, and we will loop through each line of the border (spine) to turn them off. As we _only_ need the xticklabels for the final plot, we will add an if statement to handle that. We will also add in our country labels here. In our for loop we add: |
| 182 | + |
| 183 | +```` |
| 184 | +
|
| 185 | + # make background transparent |
| 186 | + rect = ax_objs[-1].patch |
| 187 | + rect.set_alpha(0) |
| 188 | + |
| 189 | + # remove borders, axis ticks, and labels |
| 190 | + ax_objs[-1].set_yticklabels([]) |
| 191 | + ax_objs[-1].set_ylabel('') |
| 192 | +
|
| 193 | + if i == len(countries)-1: |
| 194 | + pass |
| 195 | + else: |
| 196 | + ax_objs[-1].set_xticklabels([]) |
| 197 | + |
| 198 | + spines = ["top","right","left","bottom"] |
| 199 | + for s in spines: |
| 200 | + ax_objs[-1].spines[s].set_visible(False) |
| 201 | +
|
| 202 | + country = country.replace(" ","\n") |
| 203 | + ax_objs[-1].text(-0.02,0,country,fontweight="bold",fontsize=14,ha="center") |
| 204 | +
|
| 205 | +```` |
| 206 | + |
| 207 | + |
| 208 | +As an alternative to the above, we can use the `KernelDensity` module from `sklearn.neighbors` to create our distribution. This gives us a bit more control over our bandwith. The method here is taken from Jake VanderPlas's fantastic _Python Data Science Handbook_, you can read his full excerpt [here](https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html). We can reuse most of the above code, but need to make a couple of changes. Rather than repeat myself, I'll add the full snippet here and you can see the changes and minor additions (added title, label to xaxis). |
| 209 | + |
| 210 | +### Complete Plot Snippet <a id="snippet"></a> |
| 211 | + |
| 212 | +```` |
| 213 | +countries = [x for x in np.unique(data.country)] |
| 214 | +colors = ['#0000ff', '#3300cc', '#660099', '#990066', '#cc0033', '#ff0000'] |
| 215 | +
|
| 216 | +gs = grid_spec.GridSpec(len(countries),1) |
| 217 | +fig = plt.figure(figsize=(16,9)) |
| 218 | +
|
| 219 | +i = 0 |
| 220 | +
|
| 221 | +ax_objs = [] |
| 222 | +for country in countries: |
| 223 | + country = countries[i] |
| 224 | + x = np.array(data[data.country == country].score) |
| 225 | + x_d = np.linspace(0,1, 1000) |
| 226 | +
|
| 227 | + kde = KernelDensity(bandwidth=0.03, kernel='gaussian') |
| 228 | + kde.fit(x[:, None]) |
| 229 | +
|
| 230 | + logprob = kde.score_samples(x_d[:, None]) |
| 231 | + |
| 232 | + # creating new axes object |
| 233 | + ax_objs.append(fig.add_subplot(gs[i:i+1, 0:])) |
| 234 | + |
| 235 | + # plotting the distribution |
| 236 | + ax_objs[-1].plot(x_d, np.exp(logprob),color="#f0f0f0",lw=1) |
| 237 | + ax_objs[-1].fill_between(x_d, np.exp(logprob), alpha=1,color=colors[i]) |
| 238 | + |
| 239 | +
|
| 240 | + # setting uniform x and y lims |
| 241 | + ax_objs[-1].set_xlim(0,1) |
| 242 | + ax_objs[-1].set_ylim(0,2.5) |
| 243 | +
|
| 244 | + # make background transparent |
| 245 | + rect = ax_objs[-1].patch |
| 246 | + rect.set_alpha(0) |
| 247 | + |
| 248 | + # remove borders, axis ticks, and labels |
| 249 | + ax_objs[-1].set_yticklabels([]) |
| 250 | +
|
| 251 | + if i == len(countries)-1: |
| 252 | + ax_objs[-1].set_xlabel("Test Score", fontsize=16,fontweight="bold") |
| 253 | + else: |
| 254 | + ax_objs[-1].set_xticklabels([]) |
| 255 | + |
| 256 | + spines = ["top","right","left","bottom"] |
| 257 | + for s in spines: |
| 258 | + ax_objs[-1].spines[s].set_visible(False) |
| 259 | + |
| 260 | + adj_country = country.replace(" ","\n") |
| 261 | + ax_objs[-1].text(-0.02,0,adj_country,fontweight="bold",fontsize=14,ha="right") |
| 262 | +
|
| 263 | +
|
| 264 | + i += 1 |
| 265 | + |
| 266 | +gs.update(hspace=-0.7) |
| 267 | +
|
| 268 | +fig.text(0.07,0.85,"Distribution of Aptitude Test Results from 18 – 24 year-olds",fontsize=20) |
| 269 | +
|
| 270 | +plt.tight_layout() |
| 271 | +plt.show() |
| 272 | +```` |
| 273 | + |
| 274 | + |
| 275 | + |
| 276 | +I'll finish this off with a little project to put the above code into practice. The data provided also contains information on whether the test taker was male or female. Using the above code as a template, see how you get on creating something like this: |
| 277 | + |
| 278 | + |
| 279 | + |
| 280 | +For those more ambitious, this could be turned into a split violin plot with males on one side and females on the other. Is there a way to combine the ridge and violin plot? |
| 281 | + |
| 282 | +I'd love to see what people come back with so if you do create something, send it to me on twitter [here](http://twitter.com/petermckeever)! |
| 283 | + |
0 commit comments