TypeError: Expected sequence or array-like, got estimator

Question

I am working on a project that has user reviews on products. I am using TfidfVectorizer to extract features from my dataset apart from some other features that I have extracted manually.

df = pd.read_csv('reviews.csv', header=0)

FEATURES = ['feature1', 'feature2']
reviews = df['review']
reviews = reviews.values.flatten()

vectorizer = TfidfVectorizer(min_df=1, decode_error='ignore', ngram_range=(1, 3), stop_words='english', max_features=45)

X = vectorizer.fit_transform(reviews)
idf = vectorizer.idf_
features = vectorizer.get_feature_names()
FEATURES += features
inverse =  vectorizer.inverse_transform(X)
  
for i, row in df.iterrows():
   for f in features:
      df.set_value(i, f, False)
      for inv in inverse[i]:
        df.set_value(i, inv, True)

train_df, test_df = train_test_split(df, test_size = 0.2, random_state=700)

The above code works fine. But when I change the max_features from 45 to anything higher I get an error on tran_test_split line.

Traceback as follows:

Traceback (most recent call last):
  File "analysis.py", line 120, in <module>
    train_df, test_df = train_test_split(df, test_size = 0.2, random_state=700)
  File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1906, in train_test_split
    arrays = indexable(*arrays)
  File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 201, in indexable
    check_consistent_length(*result)
  File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 173, in check_consistent_length
    uniques = np.unique([_num_samples(X) for X in arrays if X is not None])
  File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 112, in _num_samples
    'estimator %s' % x)
TypeError: Expected sequence or array-like, got estimator

I am not sure what exactly is changing when I change increase the max_features size.

Let me know if you need more data or if I have missed something

You should pass X and y to train_test_split(), not pandas Frame object. — sergzach
– sergzach, Commented Sep 28, 2016 at 12:56
@sergzach I used this answer stackoverflow.com/a/24151789/3735157 — Deepak Puthraya
– Deepak Puthraya, Commented Sep 28, 2016 at 13:05
@sergzach The code works fine when I use 45 features from the Tfidf. But when I increase the above features more than this it gives me the error. I followed another technique to split my features set into train and test and that worked for more than 45 features. So my question is what changed when I increased from 45. — Deepak Puthraya
– Deepak Puthraya, Commented Sep 28, 2016 at 13:17
@NickilMaveli df.info <class 'pandas.core.frame.DataFrame'> RangeIndex: 49998 entries, 0 to 49997 Columns: 934 entries, Unnamed: 0 to yes dtypes: bool(914), float64(3), int64(16), object(1) memory usage: 51.2+ MB None. Please note that 934 is because currently I am adding 900 features through tfidf. — Deepak Puthraya
– Deepak Puthraya, Commented Sep 28, 2016 at 13:18

elz · Accepted Answer · 2017-05-12 21:08:00Z

I know this is old, but I had the same issue and while the answer from @shahins works, I wanted something that would keep the dataframe object so I can have my indexing in the train/test splits.

Solution:

Rename the dataframe column fit as something (anything) else:

df = df.rename(columns = {'fit': 'fit_feature'})

Why it works:

It isn't actually the number of features that is the issue, it is one feature in particular that is causing the problem. I'm guessing you are getting the word "fit" as one of your text features (and it didn't show up with the lower max_features threshold).

Looking at the sklearn source code, it checks to make sure you are not passing an sklearn estimator by testing to see if the any of your objects have a "fit" attribute. The code is checking for the fit method of an sklearn estimator, but will also raise an exception when you have a fit column of the dataframe (remember df.fit and df['fit'] both select the "fit" column).

happyhuman · Accepted Answer · 2017-01-23 21:47:44Z

3

I had this issue and I tried something like this and it worked for me:

train_test_split(df.as_matrix(), test_size = 0.2, random_state=700)

answered Jan 23, 2017 at 21:47

happyhuman

1,6512 gold badges19 silver badges32 bronze badges

1 Comment

billmanH Over a year ago

Actually this works for the same reason as @elphz describes. Having a column called 'fit' causes the issue. You can either rename it or convert it into a matrix. If you want to keep the original column names (e.g. the features are words) then this is the better way.

Pratibha · Accepted Answer · 2018-04-26 04:47:11Z

0

train_test_split(x.as_matrix(), y.as_matrix(), test_size=0.2, random_state=0)

This worked for me.

answered Apr 26, 2018 at 4:47

Pratibha

611 silver badge1 bronze badge

Collectives™ on Stack Overflow

TypeError: Expected sequence or array-like, got estimator

3 Answers 3

Solution:

Why it works:

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Solution:

Why it works:

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related