7

I am working on a project that has user reviews on products. I am using TfidfVectorizer to extract features from my dataset apart from some other features that I have extracted manually.

df = pd.read_csv('reviews.csv', header=0)

FEATURES = ['feature1', 'feature2']
reviews = df['review']
reviews = reviews.values.flatten()

vectorizer = TfidfVectorizer(min_df=1, decode_error='ignore', ngram_range=(1, 3), stop_words='english', max_features=45)

X = vectorizer.fit_transform(reviews)
idf = vectorizer.idf_
features = vectorizer.get_feature_names()
FEATURES += features
inverse =  vectorizer.inverse_transform(X)
  
for i, row in df.iterrows():
   for f in features:
      df.set_value(i, f, False)
      for inv in inverse[i]:
        df.set_value(i, inv, True)

train_df, test_df = train_test_split(df, test_size = 0.2, random_state=700)

The above code works fine. But when I change the max_features from 45 to anything higher I get an error on tran_test_split line.

Traceback as follows:

Traceback (most recent call last):
  File "analysis.py", line 120, in <module>
    train_df, test_df = train_test_split(df, test_size = 0.2, random_state=700)
  File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1906, in train_test_split
    arrays = indexable(*arrays)
  File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 201, in indexable
    check_consistent_length(*result)
  File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 173, in check_consistent_length
    uniques = np.unique([_num_samples(X) for X in arrays if X is not None])
  File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 112, in _num_samples
    'estimator %s' % x)
TypeError: Expected sequence or array-like, got estimator

I am not sure what exactly is changing when I change increase the max_features size.

Let me know if you need more data or if I have missed something

6
  • What is df.info() like? Commented Sep 28, 2016 at 12:30
  • You should pass X and y to train_test_split(), not pandas Frame object. Commented Sep 28, 2016 at 12:56
  • 1
    @sergzach I used this answer stackoverflow.com/a/24151789/3735157 Commented Sep 28, 2016 at 13:05
  • @sergzach The code works fine when I use 45 features from the Tfidf. But when I increase the above features more than this it gives me the error. I followed another technique to split my features set into train and test and that worked for more than 45 features. So my question is what changed when I increased from 45. Commented Sep 28, 2016 at 13:17
  • @NickilMaveli df.info <class 'pandas.core.frame.DataFrame'> RangeIndex: 49998 entries, 0 to 49997 Columns: 934 entries, Unnamed: 0 to yes dtypes: bool(914), float64(3), int64(16), object(1) memory usage: 51.2+ MB None. Please note that 934 is because currently I am adding 900 features through tfidf. Commented Sep 28, 2016 at 13:18

3 Answers 3

8

I know this is old, but I had the same issue and while the answer from @shahins works, I wanted something that would keep the dataframe object so I can have my indexing in the train/test splits.

Solution:

Rename the dataframe column fit as something (anything) else:

df = df.rename(columns = {'fit': 'fit_feature'})

Why it works:

It isn't actually the number of features that is the issue, it is one feature in particular that is causing the problem. I'm guessing you are getting the word "fit" as one of your text features (and it didn't show up with the lower max_features threshold).

Looking at the sklearn source code, it checks to make sure you are not passing an sklearn estimator by testing to see if the any of your objects have a "fit" attribute. The code is checking for the fit method of an sklearn estimator, but will also raise an exception when you have a fit column of the dataframe (remember df.fit and df['fit'] both select the "fit" column).

Sign up to request clarification or add additional context in comments.

Comments

3

I had this issue and I tried something like this and it worked for me:

train_test_split(df.as_matrix(), test_size = 0.2, random_state=700)

1 Comment

Actually this works for the same reason as @elphz describes. Having a column called 'fit' causes the issue. You can either rename it or convert it into a matrix. If you want to keep the original column names (e.g. the features are words) then this is the better way.
0
train_test_split(x.as_matrix(), y.as_matrix(), test_size=0.2, random_state=0)

This worked for me.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.