Python - Feature hashing on list of dictionaries with lists of strings

Question

I want to be able to take a list of dictionaries (records) in which some of its columns have a list of values as the value for the cell. Here is an example

[{'fruit': 'apple', 'age': 27}, {'fruit':['apple', 'banana'], 'age': 32}]

How can I take this input and perform a feature hash on it (in my data set I have thousands of columns). Currently I am using one hot encoding, but this seems to consume a lot of ram (more than what I have on my system).

I tried taking my dataset as above and got an error:

x__ = h.transform(data)

Traceback (most recent call last):

  File "<ipython-input-14-db4adc5ec623>", line 1, in <module>
    x__ = h.transform(data)

  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 142, in transform
    _hashing.transform(raw_X, self.n_features, self.dtype)

  File "sklearn/feature_extraction/_hashing.pyx", line 52, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:2103)

TypeError: a float is required

I also tried to turn it into a dataframe and pass it to the hasher:

x__ = h.transform(x_y_dataframe)

Traceback (most recent call last):

  File "<ipython-input-15-109e7f8018f3>", line 1, in <module>
    x__ = h.transform(x_y_dataframe)

  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 142, in transform
    _hashing.transform(raw_X, self.n_features, self.dtype)

  File "sklearn/feature_extraction/_hashing.pyx", line 46, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:1928)

  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 138, in <genexpr>
    raw_X = (_iteritems(d) for d in raw_X)

  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 15, in _iteritems
    return d.iteritems() if hasattr(d, "iteritems") else d.items()

AttributeError: 'unicode' object has no attribute 'items'

Any Idea how I can implement this either with pandas or sklearn? Or maybe I can build my dummy variables a few thousand rows at a time?

Here is How I am getting my dummy variables using pandas:

def one_hot_encode(categorical_labels):
    res = []
    tmp = None
    for col in categorical_labels:
        v = x[col].astype(str).str.strip('[]').str.get_dummies(', ')#cant set a prefix
        if len(res) == 2:
            tmp = pandas.concat(res, axis=1)
            del res
            res = []
            res.append(tmp)
            del tmp
            tmp = None
        else:
            res.append(v)
    result = pandas.concat(res, axis=1)
    return result

You can transform the lists into tuples, which are hashable. — IanS
– IanS, Commented May 9, 2017 at 13:08

MaxU - stand with Ukraine · Accepted Answer · 2017-05-09 15:33:54Z

1

Consider the following approach:

from sklearn.feature_extraction.text import CountVectorizer

lst = [{'fruit': 'apple', 'age': 27}, {'fruit':['apple', 'banana'], 'age': 32}]

df = pd.DataFrame(lst)

vect = CountVectorizer()

X = vect.fit_transform(df.fruit.map(lambda x: ' '.join(x) if isinstance(x, list) else x))

r = pd.DataFrame(X.A, columns=vect.get_feature_names(), index=df.index)

df.join(r)

Result:

In [66]: r
Out[66]:
   apple  banana
0      1       0
1      1       1

In [67]: df.join(r)
Out[67]:
   age            fruit  apple  banana
0   27            apple      1       0
1   32  [apple, banana]      1       1

UPDATE: starting from Pandas 0.20.1 we can create SparseDataFrame directly from sparse matrix:

In [13]: r = pd.SparseDataFrame(X, columns=vect.get_feature_names(), index=df.index, default_fill_value=0)

In [14]: r
Out[14]:
   apple  banana
0      1       0
1      1       1

In [15]: r.memory_usage()
Out[15]:
Index     80   
apple     16   # 2 * 8 byte (np.int64)
banana     8   # 1 * 8 byte (as there is only one `1` value)
dtype: int64

In [16]: r.dtypes
Out[16]:
apple     int64
banana    int64
dtype: object

edited May 9, 2017 at 15:33

answered May 9, 2017 at 13:22

MaxU - stand with Ukraine

212k37 gold badges402 silver badges437 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Kevin Over a year ago

This does work, although I seem to run out of memory (32 gb) I guess there are a lot of columns. I also noticed that as I split the df apart, so that I can do it in sets it gave me a lot of nans (even though I drop all the nans from my dataframe ahead of time)

Kevin Over a year ago

I realized the reason I am getting na is because I am concating without setting axis to 1

MaxU - stand with Ukraine Over a year ago

@Kevin, in Pandas 0.20.1 you can create SparseDataFrame directly from sparse matrix (result of CountVectorizer). Please check my updated answer

Kevin Over a year ago

It actually works fine for numbers in a field just do astype(str). Thanks!

MaxU - stand with Ukraine Over a year ago

@Kevin, glad it helps :)

|

Collectives™ on Stack Overflow

Python - Feature hashing on list of dictionaries with lists of strings

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related