Numpy hstack - "ValueError: all the input arrays must have same number of dimensions" - but they do

Question

I am trying to join two numpy arrays. In one I have a set of columns/features after running TF-IDF on a single column of text. In the other I have one column/feature which is an integer. So I read in a column of train and test data, run TF-IDF on this, and then I want to add another integer column because I think this will help my classifier learn more accurately how it should behave.

Unfortunately, I am getting the error in the title when I try and run hstack to add this single column to my other numpy array.

Here is my code :

  #reading in test/train data for TF-IDF
  traindata = list(np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,2])
  testdata = list(np.array(p.read_csv('FinalTestCSVFin.csv', delimiter=";"))[:,2])

  #reading in labels for training
  y = np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,-2]

  #reading in single integer column to join
  AlexaTrainData = p.read_csv('FinalCSVFin.csv', delimiter=";")[["alexarank"]]
  AlexaTestData = p.read_csv('FinalTestCSVFin.csv', delimiter=";")[["alexarank"]]
  AllAlexaAndGoogleInfo = AlexaTestData.append(AlexaTrainData)

  tfv = TfidfVectorizer(min_df=3,  max_features=None, strip_accents='unicode',  
        analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1) #tf-idf object
  rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001, 
                             C=1, fit_intercept=True, intercept_scaling=1.0, 
                             class_weight=None, random_state=None) #Classifier
  X_all = traindata + testdata #adding test and train data to put into tf-idf
  lentrain = len(traindata) #find length of train data
  tfv.fit(X_all) #fit tf-idf on all our text
  X_all = tfv.transform(X_all) #transform it
  X = X_all[:lentrain] #reduce to size of training set
  AllAlexaAndGoogleInfo = AllAlexaAndGoogleInfo[:lentrain] #reduce to size of training set
  X_test = X_all[lentrain:] #reduce to size of training set

  #printing debug info, output below : 
  print "X.shape => " + str(X.shape)
  print "AllAlexaAndGoogleInfo.shape => " + str(AllAlexaAndGoogleInfo.shape)
  print "X_all.shape => " + str(X_all.shape)

  #line we get error on
  X = np.hstack((X, AllAlexaAndGoogleInfo))

Below is the output and error message :

X.shape => (7395, 238377)
AllAlexaAndGoogleInfo.shape => (7395, 1)
X_all.shape => (10566, 238377)



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-2b310887b5e4> in <module>()
     31 print "X_all.shape => " + str(X_all.shape)
     32 #X = np.column_stack((X, AllAlexaAndGoogleInfo))
---> 33 X = np.hstack((X, AllAlexaAndGoogleInfo))
     34 sc = preprocessing.StandardScaler().fit(X)
     35 X = sc.transform(X)

C:\Users\Simon\Anaconda\lib\site-packages\numpy\core\shape_base.pyc in hstack(tup)
    271     # As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
    272     if arrs[0].ndim == 1:
--> 273         return _nx.concatenate(arrs, 0)
    274     else:
    275         return _nx.concatenate(arrs, 1)

ValueError: all the input arrays must have same number of dimensions

What is causing my problem here? How can I fix this? As far as I can see I should be able to join these columns? What have I misunderstood?

Thank you.

Edit :

Using the method in the answer below gets the following error :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-16-640ef6dd335d> in <module>()
---> 36 X = np.column_stack((X, AllAlexaAndGoogleInfo))
     37 sc = preprocessing.StandardScaler().fit(X)
     38 X = sc.transform(X)

C:\Users\Simon\Anaconda\lib\site-packages\numpy\lib\shape_base.pyc in column_stack(tup)
    294             arr = array(arr,copy=False,subok=True,ndmin=2).T
    295         arrays.append(arr)
--> 296     return _nx.concatenate(arrays,1)
    297 
    298 def dstack(tup):

ValueError: all the input array dimensions except for the concatenation axis must match exactly

Interestingly, I tried to print the dtype of X and this worked fine :

X.dtype => float64

However, trying to print the dtype of AllAlexaAndGoogleInfo like so :

print "AllAlexaAndGoogleInfo.dtype => " + str(AllAlexaAndGoogleInfo.dtype)

produces :

'DataFrame' object has no attribute 'dtype'

Does allAlexaAndGoogleInfo.append(X) work? My guess is that if you want to combine a DataFrame object with an numpy.ndarray, you need to use methods provided by Pandas. Or convert the DataFrame to a plain numpy array. — hpaulj
– hpaulj, Commented Mar 7, 2014 at 22:04

YS-L · Accepted Answer · 2014-03-08 02:46:12Z

23

As X is a sparse array, instead of numpy.hstack, use scipy.sparse.hstack to join the arrays. In my opinion the error message is kind of misleading here.

This minimal example illustrates the situation:

import numpy as np
from scipy import sparse

X = sparse.rand(10, 10000)
xt = np.random.random((10, 1))
print 'X shape:', X.shape
print 'xt shape:', xt.shape
print 'Stacked shape:', np.hstack((X,xt)).shape
#print 'Stacked shape:', sparse.hstack((X,xt)).shape #This works

Based on the following output

X shape: (10, 10000)
xt shape: (10, 1)

one may expect that the hstack in the following line will work, but the fact is that it throws this error:

ValueError: all the input arrays must have same number of dimensions

So, use scipy.sparse.hstack when you have a sparse array to stack.

In fact I have answered this as a comment in your another questions, and you mentioned that another error message pops up:

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))

First of all, AllAlexaAndGoogleInfo does not have a dtype as it is a DataFrame. To get it's underlying numpy array, simply use AllAlexaAndGoogleInfo.values. Check its dtype. Based on the error message, it has a dtype of object, which means that it might contain non-numerical elements like strings.

This is a minimal example that reproduces this situation:

X = sparse.rand(100, 10000)
xt = np.random.random((100, 1))
xt = xt.astype('object') # Comment this to fix the error
print 'X:', X.shape, X.dtype
print 'xt:', xt.shape, xt.dtype
print 'Stacked shape:', sparse.hstack((X,xt)).shape

The error message:

TypeError: no supported conversion for types: (dtype('float64'), dtype('O'))

So, check if there is any non-numerical values in AllAlexaAndGoogleInfo and repair them, before doing the stacking.

answered Mar 8, 2014 at 2:46

YS-L

14.8k4 gold badges52 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

hpaulj Over a year ago

The np.hstack is able to coerce types, e.g. np.hstack((X.A,xt)) works, producing an array with dtype object. sparse.hstack also works if you explicitly cast the arrays, e.g. sparse.hstack((X.astype(object), xt))

Wei Over a year ago

OMFG I was pulling my hair and banging my head before I saw this. Thank you!

Drewness · Accepted Answer · 2014-03-07 18:37:35Z

16

Use .column_stack. Like so:

X = np.column_stack((X, AllAlexaAndGoogleInfo))

From the docs:

Take a sequence of 1-D arrays and stack them as columns to make a single 2-D array. 2-D arrays are stacked as-is, just like with hstack.

answered Mar 7, 2014 at 18:37

Drewness

5,0804 gold badges35 silver badges51 bronze badges

5 Comments

Simon Kiely Over a year ago

Thanks for the response. I have updated my question above to show the error message this produces.

Drewness Over a year ago

What is the output of the 3 str(*.shape) lines?

Simon Kiely Over a year ago

It's in the question I posted above if the markup on this gets annoying, but the output is : X.shape => (7395, 238377), AllAlexaAndGoogleInfo.shape => (7395, 1) and X_all.shape => (10566, 238377). Thanks :)

Drewness Over a year ago

Try X.resize(AllAlexaAndGoogleInfo.shape) then X = np.hstack((X, AllAlexaAndGoogleInfo)).

Simon Kiely Over a year ago

This line then throws the error AttributeError: resize not found. Thank you for the idea though! :)

hpaulj · Accepted Answer · 2014-03-07 22:12:06Z

1

Try:

X = np.hstack((X, AllAlexaAndGoogleInfo.values))

I don't have a running Pandas module, so can't test it. But the DataFrame documentation describes values Numpy representation of NDFrame. np.hstack is a numpy function, and as such knows nothing about the internal structure of the DataFrame.

answered Mar 7, 2014 at 22:12

hpaulj

233k14 gold badges260 silver badges392 bronze badges

3 Comments

Simon Kiely Over a year ago

Thanks for the response, unfortunately this also falls victim to : ValueError: all the input arrays must have same number of dimensions :(

Simon Kiely Over a year ago

I tried to print this; but I believe pandas reads it in as a DataFrame object so it throws the error ` 'DataFrame' object has no attribute 'dtype'`. I am not sure how to get around this issue. Thanks a lot for your help :)

hpaulj Over a year ago

But isn't there a way of getting an ndarray expression of that DataFrame? From the documentation it looked like values would do that. There's also a 'as_matrix' method. What about ftypes? I also see dtypes in the documentation. A DataFrame may contain an ndarray, but it is not itself an ndarray.

Collectives™ on Stack Overflow

Numpy hstack - "ValueError: all the input arrays must have same number of dimensions" - but they do

3 Answers 3

2 Comments

5 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related