3

For a machine learning task I am looking for a way to merge two feature matrices, with different dimensions, so that I can feed them both to an estimator. I cannot use the scipy merging methods since these require compatible shapes. I can use the numpy merging methods, but that goes wrong when I actually try to split the array for cross validation. The error looks like this:

 Traceback (most recent call last):
  File "C:\Users\Ano\workspace\final_submission\src\linearSVM.py", line 50, in <module>
    result = ridge(train_text,train_labels,test_set,train_state,test_state)
  File "C:\Users\Ano\workspace\final_submission\src\Algorithms.py", line 90, in ridge
    x_train, x_test, y_train, y_test = cross_validation.train_test_split(train, labels, test_size = 0.2, random_state = 42)
  File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1394, in train_test_split
    arrays = check_arrays(*arrays, **options)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 211, in check_arrays
    % (size, n_samples))
ValueError: Found array with dim 77946. Expected 2

The reason that this error occurs have I found in another stackoverflow question thread:Concatenate sparse matrices in Python using SciPy/Numpy. Apparently np.vstack/hstack create two matrix objects, which caused my error.

The shapes I am dealing with:

(77946, 63677)

(77946, 55)

Basically, I am looking for a way to append those 55 extra features per sample from the second matrix to the features in the first matrix.

I also tried to create a numpy array with the appropriate dimensions and simply fill it with the feature matrices, but even creating that matrix gave me a memory error. I tried to convert it to a sparse matrix, but that didn't work either. Perhaps I am doing something wrong there?

new_matrix = sparse.csr_matrix(np.zeros((77946,63727)))
new_matrix[:,0:63676] = big_feature_matrix
new_matrix[:,63677:63727] = small_feature_matrix

Update So tried out Jaime's solution but it gave me an error:

Code involved

def feature_extraction(train,test,train_small,test_small):


    vectorizer = TfidfVectorizer(min_df = 3,strip_accents = "unicode",ngram_range = (1,2))

    cv = CountVectorizer(strip_accents = "unicode",analyzer = "word",token_pattern = r'\w{1,}')


    print("fitting Vectorizer")
    vectorizer.fit(train)
    train_small = cv.fit_transform(train_state)
    test_small = cv.transform(test_state)
    print("transforming text")
    train = vectorizer.transform(train)
    test = vectorizer.transform(test)

    new_train = sparse.hstack((train, train_small),
                                 format='csr')
    new_test = sparse.hstack((test, test_small),
                                 format='csr')


    return new_train,new_test

Full traceback

Traceback (most recent call last):
  File "C:\Users\Ano\workspace\final_submission\src\linearSVM.py", line 50, in <module>
    result = ridge(train_text,train_labels,test_set,train_small,test_small)
  File "C:\Users\Ano\workspace\final_submission\src\Algorithms.py", line 89, in ridge
    train,test = feature_extraction(train,test,train_small,test_small)
  File "C:\Users\Ano\workspace\final_submission\src\Preprocessing.py", line 109, in feature_extraction
    format='csr')
  File "C:\Python27\lib\site-packages\scipy\sparse\construct.py", line 423, in hstack
    return bmat([blocks], format=format, dtype=dtype)
  File "C:\Python27\lib\site-packages\scipy\sparse\construct.py", line 523, in bmat
    raise ValueError('blocks[%d,:] has incompatible row dimensions' % i)
ValueError: blocks[0,:] has incompatible row dimensions

The train sets have the same dimensions as before. The test sets have less samples (42157).

Update

Jaime's solution, did actually work, I just messed up when I loaded in the files, thank you for all your help!

2
  • The answer that @Jaime provided should work, can you reproduce the error with a small example, and show the example here? Commented Dec 1, 2013 at 18:37
  • I will, give me a minute. Commented Dec 1, 2013 at 18:38

1 Answer 1

4

You can use scipy.sparse.hstack:

new_matrix = scipy.sparse.hstack((big_feature_matrix, small_feature_matrix),
                                 format='csr')
Sign up to request clarification or add additional context in comments.

2 Comments

Tried it out but got error: raise ValueError('blocks[%d,:] has incompatible row dimensions' % i) ValueError: blocks[0,:] has incompatible row dimensions
I have done something stupid... I accidentally loaded the training samples into the test set, which caused the incompatible dimensions. Jaime, your solution works fine now thank you!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.