0

so my question is based on this question.

I have Twitter data where I extracted unigram features and number of orthographies features such as excalamation mark, question mark, uppercase, and lowercase. I want to stack orthographies features into transformed unigram feature. Here is my code:

X_train, X_test, y_train, y_test = train_test_split(tweet_df[['tweets', 'exclamation', 'question', 'uppercase', 'lowercase']], tweet_df['class'], stratify=tweet_df['class'],
                                 test_size = 0.2, random_state=0)

count_vect = CountVectorizer(ngram_range=(1,1))
X_train_gram = count_vect.fit_transform(X_train['tweets'])

tfidf = TfidfTransformer()
X_train_gram = tfidf.fit_transform(X_train_gram)

X_train_gram = hstack((X_train_gram,np.array(X_train['exclamation'])[:,None]))

This worked, however I can't find a way to incorporate the rest of columns (question, uppercase, lowercase) into the stack in one line of code. Here is the failed try:

X_train_gram = hstack((X_train_gram,np.array(list(X_train['exclamation'], X_train['question'], X_train['uppercase'], X_train['lowercase']))[:,None])) #list expected at most 1 arguments, got 4

X_train_gram = hstack((X_train_gram,np.array(X_train[['exclamation', 'question', 'uppercase', 'lowercase']])[:,None])) #expected dimension <= 2 array or matrix

X_train_gram = hstack((X_train_gram,np.array(X_train[['exclamation', 'question', 'uppercase', 'lowercase']].values)[:,None])) #expected dimension <= 2 array or matrix

Any help appreciated.

0

1 Answer 1

1

You have problems with list syntax and sparse.coo_matrix creation.

np.array(X_train['exclamation'])[:,None])

Series to array is 1d, with None, becomes (n,1)

np.array(list(X_train['exclamation'], X_train['question'], X_train['uppercase'], X_train['lowercase']))[:,None]

That's not valid list syntax:

In [327]: list(1,2,3,4)                                                         
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-327-e06d60ac583e> in <module>
----> 1 list(1,2,3,4)

TypeError: list() takes at most 1 argument (4 given)

next:

np.array(X_train[['exclamation', 'question', 'uppercase', 'lowercase']])[:,None])

With multiple columns, we get a DataFrame; which makes a 2d array; add the None, and get a 3d array:

In [328]: np.ones((2,3))[:,None].shape                                          
Out[328]: (2, 1, 3)

Can't make a coo matrix from a 3d array. Adding values doesn't change things. np.array(dataframe) is the same as dataframe.values.

np.array(X_train[['exclamation', 'question', 'uppercase', 'lowercase']].values)[:,None]

This has a chance of working:

hstack((X_train_gram, np.array(X_train[['exclamation', 'question', 'uppercase', 'lowercase']].values))

though I'd suggest writing

arr = np.array(X_train[['exclamation', 'question', 'uppercase', 'lowercase']].values
M = sparse.coo_matrix(arr)
sparse.hstack(( X_train_gram, M))

It's more readable, and should be easier to debug if there are problems.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.