5

I have a very large and sparse dataset of spam twitter accounts and it requires me to scale the x axis in order to be able to visualize the distribution (histogram, kde etc) and cdf of the various variables (tweets_count, number of followers/following etc).

    > describe(spammers_class1$tweets_count)
  var       n   mean      sd median trimmed mad min    max  range  skew kurtosis   se
1   1 1076817 443.47 3729.05     35   57.29  43   0 669873 669873 53.23  5974.73 3.59

In this dataset, the value 0 has a huge importance (actually 0 should have the highest density). However, with a logarithmic scale these values are ignored. I thought of changing the value to 0.1 for example, but it will not make sense that there are spam accounts that have 10^-1 followers.

So, what would be a workaround in python and matplotlib ?

2

2 Answers 2

2

Add 1 to each x value, then take the log:

import matplotlib.pyplot as plt
import numpy as np
import matplotlib.ticker as ticker

fig, ax = plt.subplots()
x = [0, 10, 100, 1000]
y = [100, 20, 10, 50]
x = np.asarray(x) + 1 
y = np.asarray(y)
ax.plot(x, y)
ax.set_xscale('log')
ax.set_xlim(x.min(), x.max())
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: '{0:g}'.format(x-1)))
ax.xaxis.set_major_locator(ticker.FixedLocator(x))
plt.show()

enter image description here


Use

ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: '{0:g}'.format(x-1)))
ax.xaxis.set_major_locator(ticker.FixedLocator(x))

to relabel the tick marks according to the non-log values of x.

(My original suggestion was to use plt.xticks(x, x-1), but this would affect all axes. To isolate the changes to one particular axes, I changed all commands calls to ax, rather than calls to plt.)


matplotlib removes points which contain a NaN, inf or -inf value. Since log(0) is -inf, the point corresponding to x=0 would be removed from a log plot.

If you increase all the x-values by 1, since log(1) = 0, the point corresponding to x=0 will not be plotted at x=log(1)=0 on the log plot.

The remaining x-values will also be shifted by one, but it will not matter to the eye since log(x+1) is very close to log(x) for large values of x.

Sign up to request clarification or add additional context in comments.

10 Comments

yes, but I will not be able to say in my paper that 50% of spammers have 0 followers. because it will be shown as 10^0 and this will mean that they have one follower (which is different).
You could relabel the tick marks with plt.xticks. I've edited the post to show how.
In order not to shift all of the data. How can I efficiently add 0.1 to 0 values, so they will come up at the 10^-1 and then relabel the ticks ? I know this is another question. but It might be a better way of doing it without contaminating all of the data -shifting only 0 values- (and looping over large numpy arrays is very slow)
If you have an array with many 0 values, you can change them to 0.1 with x[x<=0] = 0.1. Note that if the array is of dtype int, then you must first convert the array to dtype float: x = x.astype('float').
I protest in the strongest terms to modifying data before plotting it.
|
0
ax1.set_xlim(0, 1e3)

Here is the example from matplotlib documentation.

And there it sets the limit values of the axes this way:

ax1.set_xlim(1e1, 1e3)
ax1.set_ylim(1e2, 1e3)

2 Comments

This doesn't show how to go with zero values on the logarithmic scale. as log(0) is undefined so matplotlib will ignore these values.Setting the xlim to 1e1 will make the x axis start from 0.1 and still would ignore 0 (I believe). I'll try it out anyway
at least as of july 2015, matplotlib is not ignoring zeros, it draws a straight line on the log plot all the way to the edge of the plot, which looks terrible and doesn't match matlab. hayer's comment doesn't seem true to me.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.