1

I have created a function responsible for parsing JSON object, extracting useful fields and creating a Pandas data frame.

def parse_metrics_to_df(metrics):
    def extract_details(row):
        row['trial'] = row['agent']['trial']
        row['numerosity'] = row['agent']['numerosity']
        row['reliable'] = row['agent']['reliable']
        row['was_correct'] = row['performance']['was_correct']
        return row

    df = pd.DataFrame(metrics)
    df = df.apply(extract_details, axis=1)
    df.drop(['agent', 'environment', 'performance'], axis=1, inplace=True)
    df.set_index('trial', inplace=True)

    return df

The metrics is an array of JSON documents looking similar to (first two rows):

[{'agent': {'fitness': 25.2375,
   'numerosity': 1,
   'population': 1,
   'reliable': 0,
   'steps': 1,
   'total_steps': 1,
   'trial': 0},
  'environment': None,
  'performance': {'was_correct': True}},
 {'agent': {'fitness': 23.975625,
   'numerosity': 1,
   'population': 1,
   'reliable': 0,
   'steps': 1,
   'total_steps': 2,
   'trial': 1},
  'environment': None,
  'performance': {'was_correct': False}}]

Then executed as follows:

df = parse_metrics_to_df(metrics)

enter image description here

The code works as expected but it's extremely slow. Parsing array with a million objects takes nearly 1 hour.

How to do this properly?

0

2 Answers 2

1

Manipulating Series object is the bottleneck. Creating new Series from dict can be much more faster.

Setup

import pandas as pd

metrics = [{'agent': {'fitness': 25.2375,
   'numerosity': 1,
   'population': 1,
   'reliable': 0,
   'steps': 1,
   'total_steps': 1,
   'trial': 0},
  'environment': None,
  'performance': {'was_correct': True}},
 {'agent': {'fitness': 23.975625,
   'numerosity': 1,
   'population': 1,
   'reliable': 0,
   'steps': 1,
   'total_steps': 2,
   'trial': 1},
  'environment': None,
  'performance': {'was_correct': False}}]
thousand_metrics = metrics * 1000 

Original code

def parse_metrics_to_df(metrics):
    def extract_details(row):
        row['trial'] = row['agent']['trial']
        row['numerosity'] = row['agent']['numerosity']
        row['reliable'] = row['agent']['reliable']
        row['was_correct'] = row['performance']['was_correct']
        return row

    df = pd.DataFrame(metrics)
    df = df.apply(extract_details, axis=1)
    df.drop(['agent', 'environment', 'performance'], axis=1, inplace=True)
    df.set_index('trial', inplace=True)

    return df

%timeit df = parse_metrics_to_df(thousand_metrics)

# 4.06 s ± 87.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Modified code

def parse_metrics_to_df2(metrics):
    def extract_details(row):
        res = {}
        res['trial'] = row['agent']['trial']
        res['numerosity'] = row['agent']['numerosity']
        res['reliable'] = row['agent']['reliable']
        res['was_correct'] = row['performance']['was_correct']
        return pd.Series(res)

    df = pd.DataFrame(metrics)
    df = df.apply(extract_details, axis=1)
    df.set_index('trial', inplace=True)

    return df

df = parse_metrics_to_df2(thousand_metrics)
df2 = parse_metrics_to_df2(thousand_metrics)
df.equals(df2) # True

%timeit df2 = parse_metrics_to_df2(thousand_metrics)

# 566 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now, 7x faster.

Sign up to request clarification or add additional context in comments.

Comments

1

You should see a significant (~9x for me) improvement by using a simple list comprehension.

Often pd.DataFrame has overheads which can be avoided by performing manipulations before the data is put in the dataframe.

def parse_metrics_to_df(metrics):
    def extract_details(row):
        row['trial'] = row['agent']['trial']
        row['numerosity'] = row['agent']['numerosity']
        row['reliable'] = row['agent']['reliable']
        row['was_correct'] = row['performance']['was_correct']
        return row

    df = pd.DataFrame(metrics)
    df = df.apply(extract_details, axis=1)
    df.drop(['agent', 'environment', 'performance'], axis=1, inplace=True)
    df.set_index('trial', inplace=True)

    return df


def jp(metrics):

    lst = [[d['agent']['trial'], d['agent']['numerosity'], d['agent']['reliable'],
            d['performance']['was_correct']] for d in metrics]

    df = pd.DataFrame(lst, columns=['trial', 'agent', 'environment', 'performance'])
    df = df.set_index('trial')

    return df

%timeit parse_metrics_to_df(metrics)   # 14.4 ms
%timeit jp(metrics)                    # 1.6 ms

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.