1

I am trying to find the solution to convert specific column into onehotencoder type columns. For example

-------------
Content|type|
-------------
alpha  | A  |
beta   | B  |
gamma  | C  |
theta  | A  |
zeta   | C  |
neta   | B  |
-------------

And, what I am trying to do is following.

----------------------------
Content|type_A|type_B|type_C|
----------------------------
alpha  |  1   |  0   |  0   |
beta   |  0   |  1   |  0   |
gamma  |  0   |  0   |  1   |
theta  |  1   |  0   |  0   |
zeta   |  0   |  0   |  1   |
neta   |  0   |  1   |  0   |
-----------------------------
2
  • Have you looked into sparkML one hot encoder ? Or pivot functions ? Commented Jul 24, 2019 at 10:40
  • As far as I know: The one hot encoder return back the array having length of unique elements in column. And, the pivot function has dependency on groupBy function in order to get any aggregate of column. Commented Jul 24, 2019 at 11:09

1 Answer 1

1

I think pivot is what you are looking for

val df = Seq(
  ("alpha", "A"),
  ("beta", "B"),
  ("gamma", "C"),
  ("theta", "A"),
  ("zeta", "C"),
  ("neta", "B")
).toDF("Content", "type")

val result = df.groupBy("Content")
  .pivot("type")
  .agg(count("type"))
  .na.fill(0)

Output:

+-------+---+---+---+
|Content|A  |B  |C  |
+-------+---+---+---+
|neta   |0  |1  |0  |
|beta   |0  |1  |0  |
|gamma  |0  |0  |1  |
|theta  |1  |0  |0  |
|zeta   |0  |0  |1  |
|alpha  |1  |0  |0  |
+-------+---+---+---+
Sign up to request clarification or add additional context in comments.

2 Comments

Instead of column name as "A","B" and "C", can we achieve adding prefix (type) like "Type_A" using above code? Or, we have to manually change it?
Sure you can add alias for it but only works if you have more then one agg, otherwise you have to do it manually.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.