3

I have read many different blogs on this topic, but haven't been able to find a clear solution. I have the following scenario:

  1. I have a list of pairs of texts with labels 1, or -1.
  2. For each text pair , I want the features to be a concatenation in the following fashion: f () = tfidf(t1) "concat" tfidf(t2)

Any suggestions on how to do the same ? I have the following code but it gives an error:

    count_vect = TfidfVectorizer(analyzer=u'char', ngram_range=ngram_range)
    X0_train_counts = count_vect.fit_transform([x[0] for x in training_documents])
    X1_train_counts = count_vect.fit_transform([x[1] for x in training_documents])
    combined_features = FeatureUnion([("x0", X0_train_counts), ("x1", X1_train_counts)])
    clf = LinearSVC().fit(combined_features, training_target)
    average_training_accuracy += clf.score(combined_features, training_target)

Here's the error I get:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
scoreEdgesUsingClassifier(None, pos, neg, 1,ngram_range=(2,5), max_size=1000000, test_size=100000)

 scoreEdgesUsingClassifier(unc, pos, neg, number_of_iterations, ngram_range, max_size, test_size)
 X0_train_counts = count_vect.fit_transform([x[0] for x in training_documents])
 X1_train_counts = count_vect.fit_transform([x[1] for x in training_documents])
 combined_features = FeatureUnion([("x0", X0_train_counts), ("x1", X1_train_counts)])
 print "Done transforming, now training classifier"

lib/python2.7/site-packages/sklearn/pipeline.pyc in __init__(self, transformer_list, n_jobs, transformer_weights)
616         self.n_jobs = n_jobs
617         self.transformer_weights = transformer_weights
--> 618         self._validate_transformers()
619 
620     def get_params(self, deep=True):

lib/python2.7/site-packages/sklearn/pipeline.pyc in _validate_transformers(self)
660                 raise TypeError("All estimators should implement fit and "
661                                 "transform. '%s' (type %s) doesn't" %
--> 662                                 (t, type(t)))
663 
664     def _iter(self):

TypeError: All estimators should implement fit and transform. '  (0, 49025) 0.0575144797079

 (254741, 38401)    0.184394443164
 (254741, 201747)   0.186080393768
 (254741, 179231)   0.195062580945
 (254741, 156925)   0.211367771299
 (254741, 90026)    0.202458920022' (type <class 'scipy.sparse.csr.csr_matrix'>) doesn't

Update

Here's the solution:

    count_vect = TfidfVectorizer(analyzer=u'char', ngram_range=ngram_range)
    training_docs_combined = [x[0] for x in training_documents] + [x[1] for x in training_documents]        
    X_train_counts = count_vect.fit_transform(training_docs_combined)
    concat_features  = hstack((X_train_counts[0:len(training_docs_combined) / 2 ], X_train_counts[len (training_docs_combined) / 2:]))

    clf = LinearSVC().fit(concat_features, training_target)
    average_training_accuracy += clf.score(concat_features, training_target)
2
  • The labels are for a pair of texts, not a single text? What error are you getting? Commented Mar 16, 2017 at 20:51
  • I put in the error. ; Yes the labels are for a pair. Commented Mar 16, 2017 at 20:55

1 Answer 1

2

FeatureUnion from scikit-learn takes as input estimators, not data arrays.

You can either concatenate the resulting X0_train_counts, X1_train_counts arrays simply with scipy.sparse.hstack, or create two independent instances of TfidfVectorizer, apply FeatureUnion to them, and then call the fit_transform method.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! hstack did the trick. I have updated the question with the solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.