2

I'm running into some issues while using the pearsonr method from SciPy. I tried to keep it as simple as possible (note gorgeous N^2 loop), but still I'm running up against this problem. I don't entirely understand where I'm going wrong. my arrays are getting selected correctly, and have the same dimensionality.

The code I run is:

from scipy import stats
from sklearn.preprocessing import LabelBinarizer, Binarizer
from sklearn.feature_extraction.text import CountVectorizer

ny_cluster = LabelBinarizer().fit_transform(ny_raw.clusterid.values)
ny_vocab = Binarizer().fit_transform(CountVectorizer().fit_transform(ny_raw.text.values))

ny_vc_phi = np.zeros((ny_vocab.shape[1], ny_cluster.shape[1]))
for i in xrange(ny_vc_phi.shape[0]):
    for j in xrange(ny_vc_phi.shape[1]):
        ny_vc_phi[i,j] = stats.pearsonr(ny_vocab[:,i].todense(), ny_cluster[:,j])[0]

Which produces the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/data/TweetClusters/TweetsLocationBayesClf/<ipython-input-29-ff1c3ac4156d> in <module>()
      3 for i in xrange(ny_vc_phi.shape[0]):
      4     for j in xrange(ny_vc_phi.shape[1]):
----> 5         ny_vc_phi[i,j] = stats.pearsonr(ny_vocab[:,i].todense(), ny_cluster[:,j])[0]
      6 

/usr/lib/python2.7/dist-packages/scipy/stats/stats.pyc in pearsonr(x, y)
   2201     # Presumably, if abs(r) > 1, then it is only some small artifact of floating

   2202     # point arithmetic.

-> 2203     r = max(min(r, 1.0), -1.0)
   2204     df = n-2
   2205     if abs(r) == 1.0:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I really don't understand where this selection is going on. Of course it doesn't help that I don't know how the r variable is getting calculated. Could it be that I am somehow messing up my inputs?

1 Answer 1

7

Check that the arguments to pearsonr are one-dimensional arrays. That is, both ny_vocab[:,i].todense() and ny_cluster[:,j] should be 1-d. Try:

    ny_vc_phi[i,j] = stats.pearsonr(ny_vocab[:,i].todense().ravel(), ny_cluster[:,j].ravel())[0]

(I added a call to ravel() to each of the arguments of pearsonr.)

Sign up to request clarification or add additional context in comments.

2 Comments

This was really close to my solution, so I'm marking it as answered. For anyone wondering, I was using a sparse matrix, so there was an array cast that needed to be included. For anyone wondering, the line verbatim was ny_vc_phi[i,j] = stats.pearsonr(np.squeeze(np.asarray(ny_vocab[:,i].todense())), ny_cluster[:,j])[0]
I was about to edit the question and suggest squeeze as an alternative, if the problem was simply a "trivial" dimension, but you beat me to it. :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.