1

We are using below dataframe to create json file

Input file

import pandas as pd
import numpy as np
a1=["DA_STinf","DA_Stinf_NA","DA_Stinf_city","DA_Stinf_NA_ID","DA_Stinf_NA_ID_GRANT","DA_country"]
a2=["data.studentinfo","data.studentinfo.name","data.studentinfo.city","data.studentinfo.name.id","data.studentinfo.name.id.grant","data.country"]
a3=[np.NaN,np.NaN,"StringType",np.NaN,"BoolType","StringType"]
d1=pd.DataFrame(list(zip(a1,a2,a3)),columns=['data','action','datatype'])

We have to build below 2 structure using above dataframe in dynamic way we have fit above data in below format

for schema e.g::

StructType([StructField(Column_name,Datatype,True)])

for Data e.g::

F.struct(F.col(column_name)).alias(json_expected_name)

expected output structure for schema

StructType(
    [
        StructField("data", 
                    StructType(
                    [
                        StructField(
                        "studentinfo",
                        StructType(
                        [
                            StructField("city",StringType(),True),
                            StructField("name",StructType(
                            [
                            StructField("id",
                            StructType(
                                [
                                StructField("grant",BoolType(),True)
                                ])
                            )]
                        )
                    )   
                ]
            )
        ),
        StructField("country",StringType(),True)
        ])
    )   
])

2)Expected data fetch

df.select(      
    F.struct(
        F.struct(
                F.struct(F.col("DA_Stinf_city")).alias("city"),
                F.struct(
                    F.struct(F.col("DA_Stinf_NA_ID_GRANT")).alias("id")
                    ).alias("name"),
        ).alias("studentinfo"),
        F.struct(F.col("DA_country")).alias("country")
    ).alias("data")
)

We have to use for loop and add these kind of entry in (data.studentinfo.name.id) data->studentinfo->name->id Which I have already add in expected output structure

4
  • I had a similar problem and used the solution of this Post Commented Feb 4, 2023 at 5:20
  • I am not doing pandas to spark dataframe, I want to create json structure Commented Feb 4, 2023 at 5:56
  • I am ok with python solution as well,I'll manage in pyspark Commented Feb 4, 2023 at 5:58
  • On the basis of . Present in action column We have to drill down the structure in json Commented Feb 4, 2023 at 6:09

1 Answer 1

0

this is the resulting json. How you need to reassemble the json into a new hierarchial json structure that you desire. Action has the hierarchy elements to your tree and data type the type. I think you can assume null data types are numeric. The name datatype is wrong as null. It should be stringtype

import pandas as pd
import numpy as np
import json

   a1=["DA_STinf","DA_Stinf_NA","DA_Stinf_city","DA_Stinf_NA_ID","DA_Stinf_NA_ID_GRANT","DA_country"]
a2=["data.studentinfo","data.studentinfo.name","data.studentinfo.city","data.studentinfo.name.id","data.studentinfo.name.id.grant","data.country"]
a3=["StructType","StructTypeType","StringType","NumberType","BoolType","StringType"]
df=pd.DataFrame(list(zip(a1,a2,a3)),columns=['data','action','datatype'])

json_tree=df.to_json()


{
   "data":{
      "0":"DA_STinf",
      "1":"DA_Stinf_NA",
      "2":"DA_Stinf_city",
      "3":"DA_Stinf_NA_ID",
      "4":"DA_Stinf_NA_ID_GRANT",
      "5":"DA_country"
   },
   "action":{
      "0":"data.studentinfo",
      "1":"data.studentinfo.name",
      "2":"data.studentinfo.city",
      "3":"data.studentinfo.name.id",
      "4":"data.studentinfo.name.id.grant",
      "5":"data.country"
   },
   "datatype":{
      "0":"StructType",
      "1":"StructType",
      "2":"StringType",
      "3":"NumericType",
      "4":"BoolType",
      "5":"StringType"
   }
}

def convert_action_to_hierarchy(data):
data=json.loads(data)
action = data['action']
datatype_list = data['datatype']
result = {}
for i in range(len(action)):
    action_list = action[str(i)].split('.')
   
    temp = result
    for j in range(len(action_list)):
        datatype = datatype_list[str(j)]
        result[action_list[j]]=(j,datatype)
                   
return result

print(convert_action_to_hierarchy(json_tree))

output:

{'data': (0, 'StructType'), 'studentinfo': (1, 'StructType'), 'name': (2, 'StringType'), 'city': (2, 'StringType'), 'id': (3, 'NumberType'), 'grant': (4, 'BoolType'), 'country': (1, 'StringType')}

The number is the level in the hierarchy

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.