1

I want to be able to take a list of dictionaries (records) in which some of its columns have a list of values as the value for the cell. Here is an example

[{'fruit': 'apple', 'age': 27}, {'fruit':['apple', 'banana'], 'age': 32}]

How can I take this input and perform a feature hash on it (in my data set I have thousands of columns). Currently I am using one hot encoding, but this seems to consume a lot of ram (more than what I have on my system).

I tried taking my dataset as above and got an error:

x__ = h.transform(data)

Traceback (most recent call last):

  File "<ipython-input-14-db4adc5ec623>", line 1, in <module>
    x__ = h.transform(data)

  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 142, in transform
    _hashing.transform(raw_X, self.n_features, self.dtype)

  File "sklearn/feature_extraction/_hashing.pyx", line 52, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:2103)

TypeError: a float is required

I also tried to turn it into a dataframe and pass it to the hasher:

x__ = h.transform(x_y_dataframe)

Traceback (most recent call last):

  File "<ipython-input-15-109e7f8018f3>", line 1, in <module>
    x__ = h.transform(x_y_dataframe)

  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 142, in transform
    _hashing.transform(raw_X, self.n_features, self.dtype)

  File "sklearn/feature_extraction/_hashing.pyx", line 46, in sklearn.feature_extraction._hashing.transform (sklearn/feature_extraction/_hashing.c:1928)

  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 138, in <genexpr>
    raw_X = (_iteritems(d) for d in raw_X)

  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/hashing.py", line 15, in _iteritems
    return d.iteritems() if hasattr(d, "iteritems") else d.items()

AttributeError: 'unicode' object has no attribute 'items'

Any Idea how I can implement this either with pandas or sklearn? Or maybe I can build my dummy variables a few thousand rows at a time?

Here is How I am getting my dummy variables using pandas:

def one_hot_encode(categorical_labels):
    res = []
    tmp = None
    for col in categorical_labels:
        v = x[col].astype(str).str.strip('[]').str.get_dummies(', ')#cant set a prefix
        if len(res) == 2:
            tmp = pandas.concat(res, axis=1)
            del res
            res = []
            res.append(tmp)
            del tmp
            tmp = None
        else:
            res.append(v)
    result = pandas.concat(res, axis=1)
    return result
1
  • You can transform the lists into tuples, which are hashable. Commented May 9, 2017 at 13:08

1 Answer 1

1

Consider the following approach:

from sklearn.feature_extraction.text import CountVectorizer

lst = [{'fruit': 'apple', 'age': 27}, {'fruit':['apple', 'banana'], 'age': 32}]

df = pd.DataFrame(lst)

vect = CountVectorizer()

X = vect.fit_transform(df.fruit.map(lambda x: ' '.join(x) if isinstance(x, list) else x))

r = pd.DataFrame(X.A, columns=vect.get_feature_names(), index=df.index)

df.join(r)

Result:

In [66]: r
Out[66]:
   apple  banana
0      1       0
1      1       1

In [67]: df.join(r)
Out[67]:
   age            fruit  apple  banana
0   27            apple      1       0
1   32  [apple, banana]      1       1

UPDATE: starting from Pandas 0.20.1 we can create SparseDataFrame directly from sparse matrix:

In [13]: r = pd.SparseDataFrame(X, columns=vect.get_feature_names(), index=df.index, default_fill_value=0)

In [14]: r
Out[14]:
   apple  banana
0      1       0
1      1       1

In [15]: r.memory_usage()
Out[15]:
Index     80   
apple     16   # 2 * 8 byte (np.int64)
banana     8   # 1 * 8 byte (as there is only one `1` value)
dtype: int64

In [16]: r.dtypes
Out[16]:
apple     int64
banana    int64
dtype: object
Sign up to request clarification or add additional context in comments.

7 Comments

This does work, although I seem to run out of memory (32 gb) I guess there are a lot of columns. I also noticed that as I split the df apart, so that I can do it in sets it gave me a lot of nans (even though I drop all the nans from my dataframe ahead of time)
I realized the reason I am getting na is because I am concating without setting axis to 1
@Kevin, in Pandas 0.20.1 you can create SparseDataFrame directly from sparse matrix (result of CountVectorizer). Please check my updated answer
It actually works fine for numbers in a field just do astype(str). Thanks!
@Kevin, glad it helps :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.