Using Pandas to convert JSON to CSV with specific fields

Question

I am currently trying to convert a JSON file to a CSV file using Pandas.

The codes that I'm using now are able to convert the JSON to a CSV file.

import pandas as pd
json_data = pd.read_json("out1.json")
from pandas.io.json import json_normalize
df = json_normalize(json_data["events"])
df.to_csv("out.csv)

This is my JSON file:

{
  "events": [
    {
      "raw": "{\"level\": \"INFO\", \"message\": \"Disabled camera with QR scan on  by 80801234 at Area A\n\"}",
      "logtypes": [
        "json"
      ],
      "timestamp": 1537190572023,
      "unparsed": null,
      "logmsg": "{\"level\": \"INFO\", \"message\": \"Disabled camera with QR scan on  by 80801234 at Area A\n\"}",
      "id": "c77afb4c-ba7c-11e8-8000-12b233ae723a",
      "tags": [
        "INFO"
      ],
      "event": {
        "json": {
          "message": "Disabled camera with QR scan on  by 80801234 at Area A\n",
          "level": "INFO"
        },
        "http": {
          "clientHost": "116.197.237.29",
          "contentType": "text/plain; charset=UTF-8"
        }
      }
    },
    {
      "raw": "{\"level\": \"INFO\", \"message\": \"Employee number saved successfully.\"}",
      "logtypes": [
        "json"
      ],
      "timestamp": 1537190528619,
      "unparsed": null,
      "logmsg": "{\"level\": \"INFO\", \"message\": \"Employee number saved successfully.\"}",
      "id": "ad9c0175-ba7c-11e8-803d-12b233ae723a",
      "tags": [
        "INFO"
      ],
      "event": {
        "json": {
          "message": "Employee number saved successfully.",
          "level": "INFO"
        },
        "http": {
          "clientHost": "116.197.237.29",
          "contentType": "text/plain; charset=UTF-8"
        }
      }
    }
  ]
}

But what I wanted was just some fields (timestamp, level, message) inside the JSON file not all of it.

I have tried a variety of ways:

df = json_normalize(json_data["timestamp"]) // gives a KeyError on 'timestamp'

df = json_normalize(json_data, 'timestamp', ['event', 'json', ['level', 'message']]) // TypeError: string indices must be integers

Where did i went wrong?

Does df = df[['timestamp', 'event.json.level', 'event.json.message']] give what you want after df = json_normalize(json_data["events"])? — roganjosh
– roganjosh, Commented Sep 18, 2018 at 15:29
@roganjosh It does! Thanks a lot! But do you mind explaining why doesn't my 2nd way of normalizing the data works? I mean it should also be able to extract what i wanted right? Please set your reply as an answer, so I could put a tick to it — Kai
– Kai, Commented Sep 18, 2018 at 15:40
I'm trying to find a better way of doing it, I'm not overly familiar with this method so I'm learning as I go — roganjosh
– roganjosh, Commented Sep 18, 2018 at 15:49
Wow, I have to get going now, I'm not sure why I'm struggling with that so much sorry. I need to spend some more time on it but got to go out now. The examples in the docs are not enough for me to understand the nested structure, I'll have to do some more research. — roganjosh
– roganjosh, Commented Sep 18, 2018 at 15:57

piRSquared · Accepted Answer · 2018-09-18 20:43:00Z

I don't think json_normalize is intended to work on this specific orientation. I could be wrong but from the documentation, it appears that normalization means "Deal with lists within each dictionary".

Assume data is

data = json.load(open('out1.json'))['events']

Look at the first entry

data[0]['timestamp']

1537190572023

json_normalize wants this to be a list

[{'timestamp': 1537190572023}]

Create augmented `data2`

I don't actually recommend this approach.
If we create data2 accordingly:

data2 = [{**d, **{'timestamp': [{'timestamp': d['timestamp']}]}} for d in data]

We can use json_normalize

json_normalize(
    data2, 'timestamp',
    [['event', 'json', 'level'], ['event', 'json', 'message']]
)

       timestamp event.json.level                                 event.json.message
0  1537190572023             INFO  Disabled camera with QR scan on  by 80801234 a...
1  1537190528619             INFO                Employee number saved successfully.

Comprehension

I think it's simpler to just do

pd.DataFrame([
    (d['timestamp'],
     d['event']['json']['level'],
     d['event']['json']['message'])
    for d in data
], columns=['timestamp', 'level', 'message'])

       timestamp level                                            message
0  1537190572023  INFO  Disabled camera with QR scan on  by 80801234 a...
1  1537190528619  INFO                Employee number saved successfully.

`json_normalize`

But without the fancy arguments

json_normalize(data).pipe(
    lambda d: d[['timestamp']].join(
        d.filter(like='event.json')
    )
)

       timestamp event.json.level                                 event.json.message
0  1537190572023             INFO  Disabled camera with QR scan on  by 80801234 a...
1  1537190528619             INFO                Employee number saved successfully.

Collectives™ on Stack Overflow

Using Pandas to convert JSON to CSV with specific fields

1 Answer 1

Create augmented `data2`

Comprehension

`json_normalize`

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Create augmented data2

Comprehension

json_normalize

Comments

Your Answer

Sign up or log in

Post as a guest

Related

Create augmented `data2`

`json_normalize`