need help to compare two columns in spark scala

Question

I have spark dataframe like this

id1 id2 attrname attr_value attr_valuelist
1    2  test         Yes      Yes, No
2    1  test1        No       Yes, No
3    2  test2       value1    val1, Value1,value2
4    1  test3         3        0, 1, 2
5    3  test4         0        0, 1, 2
11   2  test         Yes      Yes, No
22   1  test1        No1       Yes, No
33   2  test2       value0    val1, Value1,value2
44   1  test3         11        0, 1, 2
55   3  test4         0        0, 1, 2

val df = sqlContext.sql("select id1, id2, attrname, attr_value, attr_valuelist from dftable)

i want to check attr_value in attr_valuelist if it does not exists then take only those rows

id1 id2 attrname attr_value attr_valuelist
4    1  test3         3        0, 1, 2
22   1  test1        No1       Yes, No
33   2  test2       value0    val1, Value1,value2
44   1  test3         11        0, 1, 2

Anahcolus · Accepted Answer · 2017-07-10 11:43:09Z

3

you can simply do the following with contains in your dataframe

import org.apache.spark.sql.functions._
df.filter(!(col("attr_valuelist").contains(col("attr_value")))).show(false)

you should have following output

+---+---+--------+----------+-------------------+
|id1|id2|attrname|attr_value|attr_valuelist     |
+---+---+--------+----------+-------------------+
|3  |2  |test2   |value1    |val1, Value1,value2|
|4  |1  |test3   |3         |0, 1, 2            |
|22 |1  |test1   |No1       |Yes, No            |
|33 |2  |test2   |value0    |val1, Value1,value2|
|44 |1  |test3   |11        |0, 1, 2            |
+---+---+--------+----------+-------------------+

If you want to ignore the case letters then you can simply user lower function as

df.filter(!(lower(col("attr_valuelist")).contains(lower(col("attr_value"))))).show(false)

you should have

+---+---+--------+----------+-------------------+
|id1|id2|attrname|attr_value|attr_valuelist     |
+---+---+--------+----------+-------------------+
|4  |1  |test3   |3         |0, 1, 2            |
|22 |1  |test1   |No1       |Yes, No            |
|33 |2  |test2   |value0    |val1, Value1,value2|
|44 |1  |test3   |11        |0, 1, 2            |
+---+---+--------+----------+-------------------+

edited Jul 10, 2017 at 11:43

answered Jul 10, 2017 at 11:37

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Narendra Mohan Prasad Over a year ago

Anahcolus Over a year ago

Did you test it? I added the above line and it was filtered out. And i am treating attr_valuelist as a string too not like an array.

Narendra Mohan Prasad Over a year ago

sorry I was testing through sql so it was coming as true i am also not getting it. thank you very much. I am not able to click on accept answer. I will do it in some time.

Anahcolus Over a year ago

you can accept whenever you can. and remember to upvote as well when you will be eligible. ;)

Narendra Mohan Prasad Over a year ago

sure I will do that

|

dumitru · Accepted Answer · 2017-07-10 11:32:48Z

1

You can define a custom function, user defined function in Spark, where you can test if a value from a column is contained in the value of the other column, like this:

def contains = udf((attr: String, attrList: String) => attrList.contains(attr))
def notContains = udf((attr: String, attrList: String) => !attrList.contains(attr))

you can tweak contains function how you want, and then you can select from your dataframe like this

df.where(contains(df("attr_value", df("attr_valuelist")))
df.where(notContains(df("attr_value", df("attr_valuelist")))

answered Jul 10, 2017 at 11:32

dumitru

2,11814 silver badges23 bronze badges

Collectives™ on Stack Overflow

need help to compare two columns in spark scala

2 Answers 2

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related