I am new to Pyspark and not yet familiar with all the functions and capabilities it has to offer.
I have a PySpark Dataframe with a column which contains nested JSON values, for example:
spark = SparkSession.builder.appName("SparkSQL").getOrCreate()
rows = [['Alice', """{
"level1":{
"tag1":{
"key1":"value1",
"key2":"value2",
"key3":"value3",
}
},
"level2":{
"tag1":{
"key1":"value1",
}
},
"level3":{
"tag1":{
"key1":"value1",
"key2":"value2",
"key3":"value3",
},
"tag2":{
"key1":'value1'
}
}}"""
]]
columns = ['name', 'Levels']
df = spark.createDataFrame(rows, columns)
The number of levels, tags, and key:value pairs in each tag are not in my control and may change.
My goal is to create a new Dataframe from the original with a new row for each tuple (level, tag, key, value) with the corresponding columns. Therefore, from the row with in the example, there will be new 8 rows in the form of:
(name, level, tag, key, value)
Alice, level1, tag1, key1, value1
Alice, level1, tag1, key2, value2
Alice, level1, tag1, key3, value3
Alice, level2, tag1, key1, value1
Alice, level3, tag1, key1, value1
Alice, level3, tag1, key2, value2
Alice, level3, tag1, key3, value3
Alice, level3, tag2, key1, value1