1

I want to import a json-file (even though it looks more like a plain txt-file) where each line is a small (four data pairs) json file. Every json file should be a row in the pandas Dataframe with four columns.

Example:

# Inside "data.json"

{"time": "2020-07-01:14:27:16.0000", "id": "m38dk117", "position": "66277", "active_current": "17.1"}
{"time": "2020-07-01:14:27:16.0000", "id": "m38dk118", "position": "3277", "active_current": "0.0"}
...
{"time": "2020-07-30:14:27:16.0000", "id": "m38dk006", "position": "73117", "active_current": "0.0"}

data.json is approx 30MB large and contains roughly 250.000 lines - for each day

data.json is approx 900MB large and contains roughly 7.5M lines - for each month

The following code snippet does do the job but is far too slow. Alternative options to pandas are welcome too, i am not limited to pandas. But inexperienced dealing with huge amounts of log-data.

Proposition:

import pandas as pd
import json

df = pd.DataFrame()
with open('data.json', 'r') as stacked_json_file:
    row_idx = -1

    for json_file in stacked_json_file:
        row_idx += 1
        
        df = df.append(pd.DataFrame(json.loads(json_file), index = [row_idx]))

Is this possibly slow because pd.DataFrame.append does not append IN-Place?

2
  • try reading json directly to pandas by read_json df = pd.read_json('data.json') Commented Oct 6, 2020 at 11:34
  • On a side note: the data format you have to work with is very inefficient. A (compressed) csv file would save a lot of space and time to import. I know it is often not in our hands, but I would consider asking the data source to change the format. Commented Oct 6, 2020 at 13:25

1 Answer 1

4

This should work

pd.read_json('data.json', lines=True)

Out:

                       time        id  position  active_current
0  2020-07-01:14:27:16.0000  m38dk117     66277            17.1
1  2020-07-01:14:27:16.0000  m38dk118      3277             0.0
2  2020-07-30:14:27:16.0000  m38dk006     73117             0.0

If this doesn't work:

Assuming your data is not valid json, but a text file with a json object in every line. You can create a list of dicts first to avoid appending to the dataframe

with open('data.json', 'r') as f:
  dictlist = [json.loads(x) for x in f]
pd.DataFrame(dictlist)

Out:

                       time        id position active_current
0  2020-07-01:14:27:16.0000  m38dk117    66277           17.1
1  2020-07-01:14:27:16.0000  m38dk118     3277            0.0
2  2020-07-30:14:27:16.0000  m38dk006    73117            0.0

Micro-Benchmark

Comparing iterative append, creating a list of dictionaries and pd.read_json.

Result

benchmark results

Code used for the benchmark

import pandas as pd
import json
import io

def makedata(n):
  t = '''{"time": "2020-07-01:14:27:16.0000", "id": "m38dk117", "position": "66277", "active_current": "17.1"}
{"time": "2020-07-01:14:27:16.0000", "id": "m38dk118", "position": "3277", "active_current": "0.0"}
{"time": "2020-07-30:14:27:16.0000", "id": "m38dk006", "position": "73117", "active_current": "0.0"}
''' * n
  return t

def pdjson(file):
  return pd.read_json(io.StringIO(file), lines=True)

def dictlist(file):
  with io.StringIO(file) as f:
    l = [json.loads(x) for x in f]
  return pd.DataFrame(l)

def appenddf(file):
  df = pd.DataFrame()
  with io.StringIO(file) as stacked_json_file:
      row_idx = -1

      for json_file in stacked_json_file:
          row_idx += 1
          
          df = df.append(pd.DataFrame(json.loads(json_file), index = [row_idx]))
  return df

import perfplot
perfplot.show(
    setup = makedata,
    kernels = [appenddf, dictlist, pdjson],
    n_range= [2**k for k in range(5,16)],
    equality_check=None,
    xlabel='len(df)'
);
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. Works perfectly. As a sidenote: In dictlist-approach: Is replacing [] by (), to create a generator, more efficient or negligible? If so, how would one check this?
I compared lists and generators with small files (~200 MB) of your test data: Negligible difference in runtime. So for big files your idea to use generators seems to be a good option if memory usage is of concern.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.