I would like to explode a column to rows in a dataframe on pyspark hive. There are two columns in the dataframe. The column "business_id" is a string. The column "sports_info" is a struct type, each element value is an array of string.
Data:
business_id sports_info
"abc-123" {"sports_type":
["{sport_name:most_recent,
sport_events:[{sport_id:568, val:10.827},{id:171,score:8.61}]}"
]
}
I need to get a dataframe like:
business_id. sport_id
"abc-123" 568
"abc-123" 171
I defined:
schema = StructType([ \
StructField("sports_type",ArrayType(),True)
])
df = spark.createDataFrame(data=data, schema=schema) # I am not sure how to create the df
df.printSchema()
df.show(truncate=False)
def get_ids(val):
sports_type = 'sports_type'
sport_events = 'sport_events'
sport_id = 'sport_id'
sport_ids_vals = eval(val.sports_type[0])['sport_events']
ids = [s['sport_id'] for s in sport_ids_scores]
return ids
df2 = df.withColumn('sport_new', F.udf(lambda x: get_ids(x),
ArrayType(ArrayType(StringType())))('sports_info'))
How could I create the df and extract/explode the inner nested elements?