1

I am currently trying to convert a JSON file to a CSV file using Pandas.

The codes that I'm using now are able to convert the JSON to a CSV file.

import pandas as pd
json_data = pd.read_json("out1.json")
from pandas.io.json import json_normalize
df = json_normalize(json_data["events"])
df.to_csv("out.csv)

This is my JSON file:

{
  "events": [
    {
      "raw": "{\"level\": \"INFO\", \"message\": \"Disabled camera with QR scan on  by 80801234 at Area A\n\"}",
      "logtypes": [
        "json"
      ],
      "timestamp": 1537190572023,
      "unparsed": null,
      "logmsg": "{\"level\": \"INFO\", \"message\": \"Disabled camera with QR scan on  by 80801234 at Area A\n\"}",
      "id": "c77afb4c-ba7c-11e8-8000-12b233ae723a",
      "tags": [
        "INFO"
      ],
      "event": {
        "json": {
          "message": "Disabled camera with QR scan on  by 80801234 at Area A\n",
          "level": "INFO"
        },
        "http": {
          "clientHost": "116.197.237.29",
          "contentType": "text/plain; charset=UTF-8"
        }
      }
    },
    {
      "raw": "{\"level\": \"INFO\", \"message\": \"Employee number saved successfully.\"}",
      "logtypes": [
        "json"
      ],
      "timestamp": 1537190528619,
      "unparsed": null,
      "logmsg": "{\"level\": \"INFO\", \"message\": \"Employee number saved successfully.\"}",
      "id": "ad9c0175-ba7c-11e8-803d-12b233ae723a",
      "tags": [
        "INFO"
      ],
      "event": {
        "json": {
          "message": "Employee number saved successfully.",
          "level": "INFO"
        },
        "http": {
          "clientHost": "116.197.237.29",
          "contentType": "text/plain; charset=UTF-8"
        }
      }
    }
  ]
}

But what I wanted was just some fields (timestamp, level, message) inside the JSON file not all of it.

I have tried a variety of ways:

df = json_normalize(json_data["timestamp"]) // gives a KeyError on 'timestamp'

df = json_normalize(json_data, 'timestamp', ['event', 'json', ['level', 'message']]) // TypeError: string indices must be integers

Where did i went wrong?

4
  • Does df = df[['timestamp', 'event.json.level', 'event.json.message']] give what you want after df = json_normalize(json_data["events"])? Commented Sep 18, 2018 at 15:29
  • @roganjosh It does! Thanks a lot! But do you mind explaining why doesn't my 2nd way of normalizing the data works? I mean it should also be able to extract what i wanted right? Please set your reply as an answer, so I could put a tick to it Commented Sep 18, 2018 at 15:40
  • I'm trying to find a better way of doing it, I'm not overly familiar with this method so I'm learning as I go Commented Sep 18, 2018 at 15:49
  • Wow, I have to get going now, I'm not sure why I'm struggling with that so much sorry. I need to spend some more time on it but got to go out now. The examples in the docs are not enough for me to understand the nested structure, I'll have to do some more research. Commented Sep 18, 2018 at 15:57

1 Answer 1

3

I don't think json_normalize is intended to work on this specific orientation. I could be wrong but from the documentation, it appears that normalization means "Deal with lists within each dictionary".

Assume data is

data = json.load(open('out1.json'))['events']

Look at the first entry

data[0]['timestamp']

1537190572023

json_normalize wants this to be a list

[{'timestamp': 1537190572023}]

Create augmented data2

I don't actually recommend this approach.
If we create data2 accordingly:

data2 = [{**d, **{'timestamp': [{'timestamp': d['timestamp']}]}} for d in data]

We can use json_normalize

json_normalize(
    data2, 'timestamp',
    [['event', 'json', 'level'], ['event', 'json', 'message']]
)

       timestamp event.json.level                                 event.json.message
0  1537190572023             INFO  Disabled camera with QR scan on  by 80801234 a...
1  1537190528619             INFO                Employee number saved successfully.

Comprehension

I think it's simpler to just do

pd.DataFrame([
    (d['timestamp'],
     d['event']['json']['level'],
     d['event']['json']['message'])
    for d in data
], columns=['timestamp', 'level', 'message'])

       timestamp level                                            message
0  1537190572023  INFO  Disabled camera with QR scan on  by 80801234 a...
1  1537190528619  INFO                Employee number saved successfully.

json_normalize

But without the fancy arguments

json_normalize(data).pipe(
    lambda d: d[['timestamp']].join(
        d.filter(like='event.json')
    )
)

       timestamp event.json.level                                 event.json.message
0  1537190572023             INFO  Disabled camera with QR scan on  by 80801234 a...
1  1537190528619             INFO                Employee number saved successfully.
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.