Pandas Dataframe from stacked json files - Speed issue

Question

I want to import a json-file (even though it looks more like a plain txt-file) where each line is a small (four data pairs) json file. Every json file should be a row in the pandas Dataframe with four columns.

Example:

# Inside "data.json"

{"time": "2020-07-01:14:27:16.0000", "id": "m38dk117", "position": "66277", "active_current": "17.1"}
{"time": "2020-07-01:14:27:16.0000", "id": "m38dk118", "position": "3277", "active_current": "0.0"}
...
{"time": "2020-07-30:14:27:16.0000", "id": "m38dk006", "position": "73117", "active_current": "0.0"}

data.json is approx 30MB large and contains roughly 250.000 lines - for each day

data.json is approx 900MB large and contains roughly 7.5M lines - for each month

The following code snippet does do the job but is far too slow. Alternative options to pandas are welcome too, i am not limited to pandas. But inexperienced dealing with huge amounts of log-data.

Proposition:

import pandas as pd
import json

df = pd.DataFrame()
with open('data.json', 'r') as stacked_json_file:
    row_idx = -1

    for json_file in stacked_json_file:
        row_idx += 1
        
        df = df.append(pd.DataFrame(json.loads(json_file), index = [row_idx]))

Is this possibly slow because pd.DataFrame.append does not append IN-Place?

try reading json directly to pandas by read_json df = pd.read_json('data.json') — woblob
– woblob, Commented Oct 6, 2020 at 11:34
On a side note: the data format you have to work with is very inefficient. A (compressed) csv file would save a lot of space and time to import. I know it is often not in our hands, but I would consider asking the data source to change the format. — Michael Szczesny
– Michael Szczesny, Commented Oct 6, 2020 at 13:25

Michael Szczesny · Accepted Answer · 2020-10-06 14:01:03Z

4

This should work

pd.read_json('data.json', lines=True)

Out:

                       time        id  position  active_current
0  2020-07-01:14:27:16.0000  m38dk117     66277            17.1
1  2020-07-01:14:27:16.0000  m38dk118      3277             0.0
2  2020-07-30:14:27:16.0000  m38dk006     73117             0.0

If this doesn't work:

Assuming your data is not valid json, but a text file with a json object in every line. You can create a list of dicts first to avoid appending to the dataframe

with open('data.json', 'r') as f:
  dictlist = [json.loads(x) for x in f]
pd.DataFrame(dictlist)

Out:

                       time        id position active_current
0  2020-07-01:14:27:16.0000  m38dk117    66277           17.1
1  2020-07-01:14:27:16.0000  m38dk118     3277            0.0
2  2020-07-30:14:27:16.0000  m38dk006    73117            0.0

Micro-Benchmark

Comparing iterative append, creating a list of dictionaries and pd.read_json.

Result

Code used for the benchmark

import pandas as pd
import json
import io

def makedata(n):
  t = '''{"time": "2020-07-01:14:27:16.0000", "id": "m38dk117", "position": "66277", "active_current": "17.1"}
{"time": "2020-07-01:14:27:16.0000", "id": "m38dk118", "position": "3277", "active_current": "0.0"}
{"time": "2020-07-30:14:27:16.0000", "id": "m38dk006", "position": "73117", "active_current": "0.0"}
''' * n
  return t

def pdjson(file):
  return pd.read_json(io.StringIO(file), lines=True)

def dictlist(file):
  with io.StringIO(file) as f:
    l = [json.loads(x) for x in f]
  return pd.DataFrame(l)

def appenddf(file):
  df = pd.DataFrame()
  with io.StringIO(file) as stacked_json_file:
      row_idx = -1

      for json_file in stacked_json_file:
          row_idx += 1
          
          df = df.append(pd.DataFrame(json.loads(json_file), index = [row_idx]))
  return df

import perfplot
perfplot.show(
    setup = makedata,
    kernels = [appenddf, dictlist, pdjson],
    n_range= [2**k for k in range(5,16)],
    equality_check=None,
    xlabel='len(df)'
);

edited Oct 6, 2020 at 14:01

answered Oct 6, 2020 at 12:27

Michael Szczesny

5,0465 gold badges20 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Simon B Over a year ago

Thanks. Works perfectly. As a sidenote: In dictlist-approach: Is replacing [] by (), to create a generator, more efficient or negligible? If so, how would one check this?

Michael Szczesny Over a year ago

I compared lists and generators with small files (~200 MB) of your test data: Negligible difference in runtime. So for big files your idea to use generators seems to be a good option if memory usage is of concern.

Collectives™ on Stack Overflow

Pandas Dataframe from stacked json files - Speed issue

1 Answer 1

Micro-Benchmark

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Micro-Benchmark

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related