1

I have spark dataframe like this

id1 id2 attrname attr_value attr_valuelist
1    2  test         Yes      Yes, No
2    1  test1        No       Yes, No
3    2  test2       value1    val1, Value1,value2
4    1  test3         3        0, 1, 2
5    3  test4         0        0, 1, 2
11   2  test         Yes      Yes, No
22   1  test1        No1       Yes, No
33   2  test2       value0    val1, Value1,value2
44   1  test3         11        0, 1, 2
55   3  test4         0        0, 1, 2

val df = sqlContext.sql("select id1, id2, attrname, attr_value, attr_valuelist from dftable)

i want to check attr_value in attr_valuelist if it does not exists then take only those rows

id1 id2 attrname attr_value attr_valuelist
4    1  test3         3        0, 1, 2
22   1  test1        No1       Yes, No
33   2  test2       value0    val1, Value1,value2
44   1  test3         11        0, 1, 2

2 Answers 2

3

you can simply do the following with contains in your dataframe

import org.apache.spark.sql.functions._
df.filter(!(col("attr_valuelist").contains(col("attr_value")))).show(false)

you should have following output

+---+---+--------+----------+-------------------+
|id1|id2|attrname|attr_value|attr_valuelist     |
+---+---+--------+----------+-------------------+
|3  |2  |test2   |value1    |val1, Value1,value2|
|4  |1  |test3   |3         |0, 1, 2            |
|22 |1  |test1   |No1       |Yes, No            |
|33 |2  |test2   |value0    |val1, Value1,value2|
|44 |1  |test3   |11        |0, 1, 2            |
+---+---+--------+----------+-------------------+

If you want to ignore the case letters then you can simply user lower function as

df.filter(!(lower(col("attr_valuelist")).contains(lower(col("attr_value"))))).show(false)

you should have

+---+---+--------+----------+-------------------+
|id1|id2|attrname|attr_value|attr_valuelist     |
+---+---+--------+----------+-------------------+
|4  |1  |test3   |3         |0, 1, 2            |
|22 |1  |test1   |No1       |Yes, No            |
|33 |2  |test2   |value0    |val1, Value1,value2|
|44 |1  |test3   |11        |0, 1, 2            |
+---+---+--------+----------+-------------------+
Sign up to request clarification or add additional context in comments.

6 Comments

thanks Ramesh, but I see above example will also select columns if I have value like this id1|id2|attrname|attr_value|attr_valuelist ----------------------------------------------------------- 44 |1 |test3 |1 |10, 11, 12 attr_valuelist is not array it is string
Did you test it? I added the above line and it was filtered out. And i am treating attr_valuelist as a string too not like an array.
sorry I was testing through sql so it was coming as true i am also not getting it. thank you very much. I am not able to click on accept answer. I will do it in some time.
you can accept whenever you can. and remember to upvote as well when you will be eligible. ;)
sure I will do that
|
1

You can define a custom function, user defined function in Spark, where you can test if a value from a column is contained in the value of the other column, like this:

def contains = udf((attr: String, attrList: String) => attrList.contains(attr))
def notContains = udf((attr: String, attrList: String) => !attrList.contains(attr))

you can tweak contains function how you want, and then you can select from your dataframe like this

df.where(contains(df("attr_value", df("attr_valuelist")))
df.where(notContains(df("attr_value", df("attr_valuelist")))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.