22

I am trying to join two numpy arrays. In one I have a set of columns/features after running TF-IDF on a single column of text. In the other I have one column/feature which is an integer. So I read in a column of train and test data, run TF-IDF on this, and then I want to add another integer column because I think this will help my classifier learn more accurately how it should behave.

Unfortunately, I am getting the error in the title when I try and run hstack to add this single column to my other numpy array.

Here is my code :

  #reading in test/train data for TF-IDF
  traindata = list(np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,2])
  testdata = list(np.array(p.read_csv('FinalTestCSVFin.csv', delimiter=";"))[:,2])

  #reading in labels for training
  y = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-2]

  #reading in single integer column to join
  AlexaTrainData = p.read_csv('FinalCSVFin.csv', delimiter=";")[["alexarank"]]
  AlexaTestData = p.read_csv('FinalTestCSVFin.csv', delimiter=";")[["alexarank"]]
  AllAlexaAndGoogleInfo = AlexaTestData.append(AlexaTrainData)

  tfv = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode',  
        analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1) #tf-idf object
  rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                             C=1, fit_intercept=True, intercept_scaling=1.0, 
                             class_weight=None, random_state=None) #Classifier
  X_all = traindata + testdata #adding test and train data to put into tf-idf
  lentrain = len(traindata) #find length of train data
  tfv.fit(X_all) #fit tf-idf on all our text
  X_all = tfv.transform(X_all) #transform it
  X = X_all[:lentrain] #reduce to size of training set
  AllAlexaAndGoogleInfo = AllAlexaAndGoogleInfo[:lentrain] #reduce to size of training set
  X_test = X_all[lentrain:] #reduce to size of training set

  #printing debug info, output below : 
  print "X.shape => " + str(X.shape)
  print "AllAlexaAndGoogleInfo.shape => " + str(AllAlexaAndGoogleInfo.shape)
  print "X_all.shape => " + str(X_all.shape)

  #line we get error on
  X = np.hstack((X, AllAlexaAndGoogleInfo))

Below is the output and error message :

X.shape => (7395, 238377)
AllAlexaAndGoogleInfo.shape => (7395, 1)
X_all.shape => (10566, 238377)



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-2b310887b5e4> in <module>()
     31 print "X_all.shape => " + str(X_all.shape)
     32 #X = np.column_stack((X, AllAlexaAndGoogleInfo))
---> 33 X = np.hstack((X, AllAlexaAndGoogleInfo))
     34 sc = preprocessing.StandardScaler().fit(X)
     35 X = sc.transform(X)

C:\Users\Simon\Anaconda\lib\site-packages\numpy\core\shape_base.pyc in hstack(tup)
    271     # As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
    272     if arrs[0].ndim == 1:
--> 273         return _nx.concatenate(arrs, 0)
    274     else:
    275         return _nx.concatenate(arrs, 1)

ValueError: all the input arrays must have same number of dimensions

What is causing my problem here? How can I fix this? As far as I can see I should be able to join these columns? What have I misunderstood?

Thank you.

Edit :

Using the method in the answer below gets the following error :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-640ef6dd335d> in <module>()
---> 36 X = np.column_stack((X, AllAlexaAndGoogleInfo))
     37 sc = preprocessing.StandardScaler().fit(X)
     38 X = sc.transform(X)

C:\Users\Simon\Anaconda\lib\site-packages\numpy\lib\shape_base.pyc in column_stack(tup)
    294             arr = array(arr,copy=False,subok=True,ndmin=2).T
    295         arrays.append(arr)
--> 296     return _nx.concatenate(arrays,1)
    297 
    298 def dstack(tup):

ValueError: all the input array dimensions except for the concatenation axis must match exactly

Interestingly, I tried to print the dtype of X and this worked fine :

X.dtype => float64

However, trying to print the dtype of AllAlexaAndGoogleInfo like so :

print "AllAlexaAndGoogleInfo.dtype => " + str(AllAlexaAndGoogleInfo.dtype) 

produces :

'DataFrame' object has no attribute 'dtype'
1
  • Does allAlexaAndGoogleInfo.append(X) work? My guess is that if you want to combine a DataFrame object with an numpy.ndarray, you need to use methods provided by Pandas. Or convert the DataFrame to a plain numpy array. Commented Mar 7, 2014 at 22:04

3 Answers 3

23

As X is a sparse array, instead of numpy.hstack, use scipy.sparse.hstack to join the arrays. In my opinion the error message is kind of misleading here.

This minimal example illustrates the situation:

import numpy as np
from scipy import sparse

X = sparse.rand(10, 10000)
xt = np.random.random((10, 1))
print 'X shape:', X.shape
print 'xt shape:', xt.shape
print 'Stacked shape:', np.hstack((X,xt)).shape
#print 'Stacked shape:', sparse.hstack((X,xt)).shape #This works

Based on the following output

X shape: (10, 10000)
xt shape: (10, 1)

one may expect that the hstack in the following line will work, but the fact is that it throws this error:

ValueError: all the input arrays must have same number of dimensions

So, use scipy.sparse.hstack when you have a sparse array to stack.


In fact I have answered this as a comment in your another questions, and you mentioned that another error message pops up:

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))

First of all, AllAlexaAndGoogleInfo does not have a dtype as it is a DataFrame. To get it's underlying numpy array, simply use AllAlexaAndGoogleInfo.values. Check its dtype. Based on the error message, it has a dtype of object, which means that it might contain non-numerical elements like strings.

This is a minimal example that reproduces this situation:

X = sparse.rand(100, 10000)
xt = np.random.random((100, 1))
xt = xt.astype('object') # Comment this to fix the error
print 'X:', X.shape, X.dtype
print 'xt:', xt.shape, xt.dtype
print 'Stacked shape:', sparse.hstack((X,xt)).shape

The error message:

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))

So, check if there is any non-numerical values in AllAlexaAndGoogleInfo and repair them, before doing the stacking.

Sign up to request clarification or add additional context in comments.

2 Comments

The np.hstack is able to coerce types, e.g. np.hstack((X.A,xt)) works, producing an array with dtype object. sparse.hstack also works if you explicitly cast the arrays, e.g. sparse.hstack((X.astype(object), xt))
OMFG I was pulling my hair and banging my head before I saw this. Thank you!
16

Use .column_stack. Like so:

X = np.column_stack((X, AllAlexaAndGoogleInfo))

From the docs:

Take a sequence of 1-D arrays and stack them as columns to make a single 2-D array. 2-D arrays are stacked as-is, just like with hstack.

5 Comments

Thanks for the response. I have updated my question above to show the error message this produces.
What is the output of the 3 str(*.shape) lines?
It's in the question I posted above if the markup on this gets annoying, but the output is : X.shape => (7395, 238377), AllAlexaAndGoogleInfo.shape => (7395, 1) and X_all.shape => (10566, 238377). Thanks :)
Try X.resize(AllAlexaAndGoogleInfo.shape) then X = np.hstack((X, AllAlexaAndGoogleInfo)).
This line then throws the error AttributeError: resize not found. Thank you for the idea though! :)
1

Try:

X = np.hstack((X, AllAlexaAndGoogleInfo.values))

I don't have a running Pandas module, so can't test it. But the DataFrame documentation describes values Numpy representation of NDFrame. np.hstack is a numpy function, and as such knows nothing about the internal structure of the DataFrame.

3 Comments

Thanks for the response, unfortunately this also falls victim to : ValueError: all the input arrays must have same number of dimensions :(
I tried to print this; but I believe pandas reads it in as a DataFrame object so it throws the error ` 'DataFrame' object has no attribute 'dtype'`. I am not sure how to get around this issue. Thanks a lot for your help :)
But isn't there a way of getting an ndarray expression of that DataFrame? From the documentation it looked like values would do that. There's also a 'as_matrix' method. What about ftypes? I also see dtypes in the documentation. A DataFrame may contain an ndarray, but it is not itself an ndarray.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.