8

When I use onehotencoder in Spark,I will get the result as in fourth column which is a sparse vector.

// +---+--------+-------------+-------------+
// | id|category|categoryIndex|  categoryVec|
// +---+--------+-------------+-------------+
// |  0|       a|          0.0|(3,[0],[1.0])|
// |  1|       b|          2.0|(3,[2],[1.0])|
// |  2|       c|          1.0|(3,[1],[1.0])|
// |  3|      NA|          3.0|    (3,[],[])|
// |  4|       a|          0.0|(3,[0],[1.0])|
// |  5|       c|          1.0|(3,[1],[1.0])|
// +---+--------+-------------+-------------+

However, what I want is to produce 3 columns for categories just like the way it works in pandas.

>>> import pandas as pd
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0
2
  • 1
    Why would you want to do this? This will make your data very big and memory inefficient. Commented Mar 19, 2017 at 9:07
  • It will not makes the data that big because I don't have much distinct values in my dataset. The resulted features will be 122 (122 columns). I want to do that so it is easier to process them with TensorFlow. I want to feed the data as input to a neural network. Commented Mar 20, 2017 at 19:43

2 Answers 2

15

Spark's OneHotEncoder creates a sparse vector column. To create the output columns similar to pandas OneHotEncoder, we need to create a separate column for each category. We can do that with the help of pyspark dataframe's withColumn function by passing a udf as a parameter. For ex -

from pyspark.sql.functions import udf,col
from pyspark.sql.types import IntegerType


df = sqlContext.createDataFrame(sc.parallelize(
        [(0,'a'),(1,'b'),(2,'c'),(3,'d')]), ('col1','col2'))

categories = df.select('col2').distinct().rdd.flatMap(lambda x : x).collect()
categories.sort()
for category in categories:
    function = udf(lambda item: 1 if item == category else 0, IntegerType())
    new_column_name = 'col2'+'_'+category
    df = df.withColumn(new_column_name, function(col('col2')))

print df.show()

Output-

+----+----+------+------+------+------+                                         
|col1|col2|col2_a|col2_b|col2_c|col2_d|
+----+----+------+------+------+------+
|   0|   a|     1|     0|     0|     0|
|   1|   b|     0|     1|     0|     0|
|   2|   c|     0|     0|     1|     0|
|   3|   d|     0|     0|     0|     1|
+----+----+------+------+------+------+

I hope this helps.

Sign up to request clarification or add additional context in comments.

Comments

1

Cant comment because I dont have the reputation points, so answering the question instead.

This is actually one of the best things about spark pipelines and transformers! I do not understand why you would need to get it in this format. Can you elaborate?

1 Comment

Thanks for reply. Repeating my comment above: It will not makes the data that big because I don't have much distinct values in my dataset. The resulted features will be 122 (122 columns). I want to do that so it is easier to process them with TensorFlow. I want to feed the data as input to a neural network.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.