How to check if all the columns of a row are null without hardcoding any column name in the query in spark?

Question

How to check if all the columns of a row are null without hardcoding any column name in the query in pyspark ? I have a column which is object type and contains json formatted string. I want to raise exception if all the json-fields of column are null. Currently , If I do columnname.jsonfield then my code throws exception as expected but I want to check for all the fields if they are null. I cannot hardcode all the jsonfields because some of my tables contain 200+ json fields

Eg: columname is Value {"active":null,"id":null,"uuid":null}

Value.active works fine but I need something similar to value.* which is not working inside where clause in pyspark.

Can you provide some code so that we can generate an examplar dataframe? — Oli
– Oli, Commented Jun 2, 2021 at 14:35

werner · Accepted Answer · 2021-06-02 20:34:54Z

You can flatten the (probably nested) columns in the dataframe and then check if each element of the flattened structure is null:

from pyspark.sql import functions as F
from pyspark.sql import types as T
from functools import reduce

#create some testdata with nested structures
data = ['{"active":null,"id":null,"uuid":null}', 
    '{"active":1,"id":2,"uuid":3}',
    '{"active":4,"id":5,"uuid":6, "x": {"y":7, "z":8, "a":{"b":9,"c":10}}}',
    '{"active":4,"id":5,"uuid":6, "x": {"y":7, "z":8, "a":{"b":9,"c":null}}}']

df = spark.read.json(spark.sparkContext.parallelize(data))

def flatten(field, prefix=""):
    if isinstance(field, T.StructType):
        for f in field:
            yield from flatten(f, prefix)
    elif isinstance(field, T.StructField) and isinstance(field.dataType, T.StructType):
        for f in field.dataType:
            yield from flatten(f, prefix + field.name + ".")
    else:
        yield prefix + field.name

cols_are_null = [F.col(c).isNull() for c in flatten(df.schema)]
all_cols_are_null = reduce(lambda l,r: l & r, cols_are_null)
df.withColumn("all_cols_are_null", all_cols_are_null).show()

#+------+----+----+-----------------+-----------------+
#|active|  id|uuid|                x|all_cols_are_null|
#+------+----+----+-----------------+-----------------+
#|  null|null|null|             null|             true|
#|     1|   2|   3|             null|            false|
#|     4|   5|   6|  {{9, 10}, 7, 8}|            false|
#|     4|   5|   6|{{9, null}, 7, 8}|            false|
#+------+----+----+-----------------+-----------------+

Collectives™ on Stack Overflow

How to check if all the columns of a row are null without hardcoding any column name in the query in spark?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related