Convert Nested JSON to CSV using Python script

Question

I am very new to Python and am trying to convert nested JSON to CSV. Below is the Python script I am trying, but I'm not getting desired output.

import json
import pandas as pd

# Load via context manager and read_json() method
with open('employee_data1.json', 'r')as file:
    # load JSON data and parse into Dictionary object
    data = json.load(file)
    
# Load JSON as DataFrame 
df = pd.json_normalize(data)


# Print Result
print(df)

# output DataFrame to CSV file
df.to_csv('employee_data.csv')

I am actually trying 2 JSONs data with the above code, getting different output for each one.

employee_data1.json:

{
    "features": [
        {
            "candidate": {
                "first_name": "Margaret",
                "last_name": "Mcdonald",
                "skills": [
                    "skLearn",
                    "Java",
                    "R",
                    "SQL",
                    "Spark",
                    "C++"
                ],
                "state": "AL",
                "specialty": "Database",
                "experience": [
                    {
                        "company": "XYZ Corp",
                        "position": "Software Engineer",
                        "start_date": "2016-01-01",
                        "end_date": "2021-03-01"
                    },
                    {
                        "company": "ABC Inc",
                        "position": "Senior Software Engineer",
                        "start_date": "2021-04-01",
                        "end_date": null
                    }
                ],
                "relocation": "no"
            }
        },
        {
            "candidate": {
                "first_name": "Michael",
                "last_name": "Carter",
                "skills": [
                    "TensorFlow",
                    "R",
                    "Spark",
                    "MongoDB",
                    "C++",
                    "SQL"
                ],
                "state": "AR",
                "specialty": "Statistics",
                "experience": [
                    {
                        "company": "DFC Corp",
                        "position": "Software Engineer",
                        "start_date": "2016-01-01",
                        "end_date": "2021-03-01"
                    },
                    {
                        "company": "SDC Inc",
                        "position": "Senior Software Engineer",
                        "start_date": "2021-04-01",
                        "end_date": null
                    }
                ],
                "relocation": "yes"
            }
        }
    ]
}

employee_data2.json:

{
    "features": 
      {
        "candidate": {
          "first_name": "Margaret",
          "last_name": "Mcdonald",
          "skills": [
            "skLearn",
            "Java",
            "R",
            "SQL",
            "Spark",
            "C++"
          ],
          "state": "AL",
          "specialty": "Database",
          "experience": [
            {
              "company": "XYZ Corp",
              "position": "Software Engineer",
              "start_date": "2016-01-01",
              "end_date": "2021-03-01"
            },
            {
              "company": "ABC Inc",
              "position": "Senior Software Engineer",
              "start_date": "2021-04-01",
              "end_date": null
            }
          ],
          "relocation": "no"
        }
      }
  }

Below, I have chosen only a few fields, instead of all fields. I am expecting the below Desired output. I will be glad if someone can able to help me out on this.

candidate.first_name, candidate.last_name, candidate.skills, candidate.state, candidate.experience.company, candidate.experience.position

Margaret, Mcdonald, "['skLearn', 'Java', 'R', 'SQL', 'Spark', 'C++']", AL, XYZ Corp, Software Engineer

Why would you do this? JSON is a much smarter way to store and transmit this data. There's no standard that allows a bracketed list in a CSV file. — Tim Roberts
– Tim Roberts, Commented Nov 5, 2023 at 5:53
@TimRoberts Is that possible to store nested json array in SQL table?, json to csv and then store csv data into SQL table? — john
– john, Commented Nov 6, 2023 at 4:37
Again, that's not a sensible path. Storing an array inside a field is not a good choice. If you need to store this long term, use something like MongoDB that stores JSON documents natively — Tim Roberts
– Tim Roberts, Commented Nov 6, 2023 at 4:46

Bushmaster · Accepted Answer · 2023-11-05 14:46:04Z

0

You can use json_normalize() like this:

df = pd.json_normalize(your_json_data,record_path=['features',["candidate","experience"]],
                       meta=[["features","candidate","first_name"],["features","candidate","last_name"],
                              ["features","candidate","relocation"],["features","candidate","skills"],
                                    ["features","candidate","specialty"],["features","candidate","state"]])

But it will throw this error:

ValueError: operands could not be broadcast together with shape (12,) (2,)

It is probably a bug. Take a look the issue about this on github: BUG: json_normalize fails with empty arrays/lists. To avoid this error you should convert lists to string then use json_normalize finally convert string type lists to lists:

if len(your_json_data["features"]) > 1:
    for i in your_json_data["features"]:
        i["candidate"]["skills"] = str(i["candidate"]["skills"])
else:
    your_json_data["features"]["candidate"]["skills"] = str(your_json_data["features"]["candidate"]["skills"])

After json_normalize:

df ["features.candidate.skills"] = df["features.candidate.skills"].apply(ast.literal_eval)

Out:

|    | company   | position                 | start_date   | end_date   | features.candidate.first_name   | features.candidate.last_name   | features.candidate.relocation   | features.candidate.skills                             | features.candidate.specialty   | features.candidate.state   |
|---:|:----------|:-------------------------|:-------------|:-----------|:--------------------------------|:-------------------------------|:--------------------------------|:------------------------------------------------------|:-------------------------------|:---------------------------|
|  0 | XYZ Corp  | Software Engineer        | 2016-01-01   | 2021-03-01 | Margaret                        | Mcdonald                       | no                              | ['skLearn', 'Java', 'R', 'SQL', 'Spark', 'C++']       | Database                       | AL                         |
|  1 | ABC Inc   | Senior Software Engineer | 2021-04-01   | nan        | Margaret                        | Mcdonald                       | no                              | ['skLearn', 'Java', 'R', 'SQL', 'Spark', 'C++']       | Database                       | AL                         |
|  2 | DFC Corp  | Software Engineer        | 2016-01-01   | 2021-03-01 | Michael                         | Carter                         | yes                             | ['TensorFlow', 'R', 'Spark', 'MongoDB', 'C++', 'SQL'] | Statistics                     | AR                         |
|  3 | SDC Inc   | Senior Software Engineer | 2021-04-01   | nan        | Michael                         | Carter                         | yes                             | ['TensorFlow', 'R', 'Spark', 'MongoDB', 'C++', 'SQL'] | Statistics                     | AR                         |

Full code:

import ast
if len(your_json_data["features"]) > 1:
    for i in your_json_data["features"]:
        i["candidate"]["skills"] = str(i["candidate"]["skills"])
else:
    your_json_data["features"]["candidate"]["skills"] = str(your_json_data["features"]["candidate"]["skills"])

df = pd.json_normalize(your_json_data,record_path=['features',["candidate","experience"]],
                       meta=[["features","candidate","first_name"],["features","candidate","last_name"],
                ["features","candidate","relocation"],["features","candidate","skills"],
                ["features","candidate","specialty"],["features","candidate","state"]])

df["features.candidate.skills"] = df["features.candidate.skills"].apply(ast.literal_eval)

edited Nov 5, 2023 at 14:46

answered Nov 5, 2023 at 9:18

Bushmaster

4,6364 gold badges11 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

john Over a year ago

Hey at first many Thanks for your code, ur code working for employee_data1.json , but not for employee_data2.json, getting some error..

Bushmaster Over a year ago

Okey. I edited my answer. Can you check it?

john Over a year ago

Thank you very much, now working for both the jsons.. If you dont mind, can you pls make few other changes too ie. experience field column value should show similar like skills field column value only, instead of show them each experience of the same person in each record. and can u pls rename features.candidate.first_name to first_name (without having features.candidate) and same applies to other fields too. Thanks again!

Tim Roberts Over a year ago

This is not a free consulting service. YOU need to take the initiative to clean up the suggestions that were made here.

john Over a year ago

@TimRoberts I have already tried from my end, please find my script code that I've tried in my actual topic. so came here to take some help.. Am not asking without trying from my end at all.. btw, am very new to python..

Collectives™ on Stack Overflow

Convert Nested JSON to CSV using Python script

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related