0

I have the following schema:

root 
|-- event_params: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- key: string (nullable = true) 
| | |-- value: struct (nullable = true) 
| | | |-- string_value: string (nullable = true) 
| | | |-- int_value: long (nullable = true) 
| | | |-- float_value: double (nullable = true)

My event_params is an array of structs. Sample Data:

{
  "event_params": [
    {
      "element": {
        "value": {
          "string_value": "LoginVC",
          "float_value": null,
          "double_value": null,
          "int_value": null
        },
        "key": "firebase_screen_class"
      }
    },
    {
      "element": {
        "value": {
          "string_value": null,
          "float_value": null,
          "double_value": null,
          "int_value": 3600000
        },
        "key": "engagement_time_msec"
      }
    },
    {
      "element": {
        "value": {
          "string_value": "app_entered_background",
          "float_value": null,
          "double_value": null,
          "int_value": null
        },
        "key": "item_name"
      }
    }
  ]
}

How do I create a new column in the same row level with value from value.string_value where "key": "item_name". I do not want to filter rows since I want to repeat the process for two more keys.

So I want a new schema something like this:

root 
|-- item_name_string_value: string (nullable = true)
|-- firebase_screen_class_string_value: string (nullable = true)
|-- event_params: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- key: string (nullable = true) 
| | |-- value: struct (nullable = true) 
| | | |-- string_value: string (nullable = true) 
| | | |-- int_value: long (nullable = true) 
| | | |-- float_value: double (nullable = true)

I want to achieve this using PySpark.

5
  • 2
    what have you tried? what problem did you encounter? Commented Apr 27, 2021 at 19:21
  • @mck I was trying to do something similar to stackoverflow.com/questions/43756940/… but I could not convert it to a pyspark since solution was in scala. Commented Apr 27, 2021 at 19:24
  • @mck also looked at explode option but couldnt extract value based on condition Commented Apr 27, 2021 at 19:27
  • your target schema doesn't make sense. You can't have both a string and struct inside an array Commented Apr 27, 2021 at 19:35
  • @mck I want event_params and item_name_string_value at the same level. I want to extract the value based on the key condition from the array of structs and bring it to root level. Commented Apr 27, 2021 at 19:37

1 Answer 1

0

This seems to be working for me from pyspark: filtering and extract struct through ArrayType column:

curDFfil = spark.sql("select event_params from temp_data_l1 ")

df = curDFfil.select(expr("filter(event_params, s -> s.key == 'item_name')").getItem(0).alias('item_name'))
newDf = df.select(col("item_name.value.string_value").alias('item_name_string_value'))
newDf.show(10, False)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.