I am working on a project that has user reviews on products. I am using TfidfVectorizer to extract features from my dataset apart from some other features that I have extracted manually.
df = pd.read_csv('reviews.csv', header=0)
FEATURES = ['feature1', 'feature2']
reviews = df['review']
reviews = reviews.values.flatten()
vectorizer = TfidfVectorizer(min_df=1, decode_error='ignore', ngram_range=(1, 3), stop_words='english', max_features=45)
X = vectorizer.fit_transform(reviews)
idf = vectorizer.idf_
features = vectorizer.get_feature_names()
FEATURES += features
inverse = vectorizer.inverse_transform(X)
for i, row in df.iterrows():
for f in features:
df.set_value(i, f, False)
for inv in inverse[i]:
df.set_value(i, inv, True)
train_df, test_df = train_test_split(df, test_size = 0.2, random_state=700)
The above code works fine. But when I change the max_features from 45 to anything higher I get an error on tran_test_split line.
Traceback as follows:
Traceback (most recent call last):
File "analysis.py", line 120, in <module>
train_df, test_df = train_test_split(df, test_size = 0.2, random_state=700)
File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1906, in train_test_split
arrays = indexable(*arrays)
File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 201, in indexable
check_consistent_length(*result)
File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 173, in check_consistent_length
uniques = np.unique([_num_samples(X) for X in arrays if X is not None])
File "/Users/user/Tools/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py", line 112, in _num_samples
'estimator %s' % x)
TypeError: Expected sequence or array-like, got estimator
I am not sure what exactly is changing when I change increase the max_features size.
Let me know if you need more data or if I have missed something
df.info()like?train_test_split(), not pandas Frame object.<class 'pandas.core.frame.DataFrame'> RangeIndex: 49998 entries, 0 to 49997 Columns: 934 entries, Unnamed: 0 to yes dtypes: bool(914), float64(3), int64(16), object(1) memory usage: 51.2+ MB None. Please note that 934 is because currently I am adding 900 features through tfidf.