2

Suppose I have a 2-D Numpy array. It's supposed to represent the learned weights of a PyTorch linear layer. Below I'm creating an example array full of Gaussian random numbers.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = np.random.normal(size=(4, 768))
print(data.shape) # (4, 768)

I then try to use the Matplotlib function hist() to create a histogram of the values. I'm using Jupyter Notebook (Google Colab). When I call the function like below (by passing in the original 2-D array), it takes a long time to complete, and the visual output is quite bizarre.

%%time
_ = plt.hist(data, bins=100)

# Result:
# CPU times: user 48.5 s, sys: 737 ms, total: 49.2 s
# Wall time: 49.2 s

enter image description here

On the other hand, when I reshape the 2-D array into a 1-D array with reshape(), the hist() function completes almost immediately, and the visualization has the shape of what I would expect, namely a Gaussian curve.

data = data.reshape(-1)
print(data.shape) # (3072,)
%%time
_ = plt.hist(data, bins=100)

# Result:
# CPU times: user 70.7 ms, sys: 2.01 ms, total: 72.7 ms
# Wall time: 70.9 ms

enter image description here

So what exactly is going on with my first attempt where I pass in a 2-D array? Why does it take so long? What does the visualized graph represent?

Thanks for any help.

1
  • You get 768 histograms with your 4 values each distributed in 100 bins. Commented Dec 29, 2020 at 3:22

1 Answer 1

1

I am rather surprised that matplotlib, unlike numpy, does not flatten the input array first. However, the matplotlib documentation states that the input x can be an (n,) array or sequence of (n,) arrays. This is how matplotlib interprets your input - 768 arrays of shape (4,) that are displayed as in your output as 768 histograms in one graph. You don't see much because the bars are rather thin with 76800 bars to display - an increase in figure size and resolution will probably improve that. The opposite case of data = np.random.normal(size=(768, 4)) reveals this because now only 400 bars have to be displayed:

enter image description here

But we can also have a look at what matplotlib returns:

hist_count, hist_bins, hist_bars = plt.hist(data, bins=100)
print(hist_count.shape)
>>>(768, 100)
print(hist_bars)
>>><a list of 768 BarContainer objects>

Or for an even simpler version:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(123)
data = np.random.normal(size=(4, 5))
print(data.shape) #(4, 5)

hist_count, hist_bins, hist_bars = plt.hist(data, bins=6)

print(hist_count.shape) #(5, 6)
print(hist_count)
#[[0. 1. 2. 0. 0. 1.]
# [1. 0. 0. 1. 1. 1.]
# [0. 0. 1. 1. 0. 2.]
# [0. 1. 1. 0. 2. 0.]
# [0. 0. 3. 1. 0. 0.]]
print(hist_bins) #[-2.42667924 -1.65457769 -0.88247613 -0.11037458  0.66172697  1.43382853  2.20593008]
print(hist_bars) #<a list of 5 BarContainer objects>
plt.show()

![enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for the explanation. The documentation doesn't give much information on how plt.hist() treats a 2-D array. It's not intuitive to me that it would produce 768 arrays of shape (4,) that are displayed as in your output as 768 histograms in one graph.
I don't know the inner workings of matplotlib but maybe this is (if not intended, I mean) a side effect because they use similar routines to plot hist() and hist2D()? Just a guess, if somebody with more insight posts an answer, do not hesitate to accept the better answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.