Convert PANDAS dataframe to nested JSON + add array name

Question

I've been wresting with this for many days now and would appreciate any help.

I'm importing an Excel file to a Pandas data frame resulting in the following dataframe [record]:

account_id	name	timestamp	value
A0001C	Fund_1	1588618800000000000	1
B0001B	Dev_2	1601578800000000000	1

I'm looking to produce a nested JSON output (will be used to submit data to an API), include adding a records and metric labels for the arrays.

Here is the output i'm looking for:

{
  "records": [
    {
      "name": "Fund_1", 
      "account_id": "A0001C", 
      "metrics": [
        {
          "timestamp": 1588618800000000000, 
          "value": 1
        }
      ]
    }
    {
      "name": "Dev_2", 
      "account_id": "B0001B", 
      "metrics": [
        {
          "timestamp": 1601578800000000000, 
          "value": 1
        } 
      ]
    }
  ]
}

I've gotten an output of a none nested JSON data set, but not able split out the timestamp and value to add the metrics part.

for record in df.to_dict(orient='records'):

        record_data = {'records': [record]}
        payload_json = json.dumps(record_data)
        print(payload_json)

I get the following output:

{"records": [{"account_id": "A0001C", "name": "Fund_1", "Date Completed": 1588618800000000000, "Count": "1"}]}
{"records": [{"account_id": "B0001B", "name": "Dev_2", "Date Completed": 1601578800000000000, "Count": "1"}]}

Any help on how i can modify my code to add the metrics label and nest the data.

Thanks in advance.

is the data particularly large or are there any constraints which make adding an auxiliary column/using map infeasible? — sim
– sim, Commented Jan 26, 2021 at 19:39
the data is 4000 row in the Excel sheet. I'm a real novice at coding so not sure on feasibilities of using map. I can add columns to the dataframe, but I'm still ot able to nest the JSON output — Graham McD
– Graham McD, Commented Jan 26, 2021 at 20:22

sim · Accepted Answer · 2021-01-27 13:48:35Z

3

One approach is through the use of pd.apply. This allows you to apply a function to series (either column- or row-wise) in your dataframe.

In your particular case, you want to apply the function row-by-row, so you have to use apply with axis=1:

records = list(df.apply(lambda row: {"name": row["name"],
                                     "account_id": row["account_id"],
                                     "metrics": [{
                                         "timestamp": row["timestamp"],
                                         "value": row["value"]}]
                                    },
                        axis=1).values)
payload = {"records": records}

Alternatively, you could introduce an auxiliary column "metrics" in which you store your metrics (subsequently applying pd.to_json):

df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
                                     "value": e.value}], 
                         axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}

Here's a full example applying option 2:

import io
import json

import pandas as pd

data = io.StringIO("""account_id    name    timestamp   value
A0001C  Fund_1  1588618800000000000 1
B0001B  Dev_2   1601578800000000000 1""")

df = pd.read_csv(data, sep="\t")

df["metrics"] = df.apply(lambda e: [{"timestamp": e.timestamp,
                                     "value": e.value}], 
                         axis=1)
records = df[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))

Output:

{
    "records": [
        {
            "account_id": "A0001C",
            "name": "Fund_1",
            "metrics": [
                {
                    "timestamp": 1588618800000000000,
                    "value": 1
                }
            ]
        },
        {
            "account_id": "B0001B",
            "name": "Dev_2",
            "metrics": [
                {
                    "timestamp": 1601578800000000000,
                    "value": 1
                }
            ]
        }
    ]
}

Edit: The second approach also makes grouping by accounts (in case you want to do that) rather easy. Below is a small example and output:

import io
import json

import pandas as pd

data = io.StringIO("""account_id    name    timestamp   value
A0001C  Fund_1  1588618800000000000 1
A0001C  Fund_1  1588618900000000000 2
B0001B  Dev_2   1601578800000000000 1""")

df = pd.read_csv(data, sep="\t")

# adding the metrics column as above
df["metrics"] = df.apply(lambda e: {"timestamp": e.timestamp,
                                    "value": e.value}, 
                         axis=1)

# group metrics by account
df_grouped = df.groupby(by=["name", "account_id"]).metrics.agg(list).reset_index()
records = df_grouped[["account_id", "name", "metrics"]].to_dict(orient="records")
payload = {"records": records}
print(json.dumps(payload, indent=4))

Output:

{
    "records": [
        {
            "account_id": "B0001B",
            "name": "Dev_2",
            "metrics": [
                {
                    "timestamp": 1601578800000000000,
                    "value": 1
                }
            ]
        },
        {
            "account_id": "A0001C",
            "name": "Fund_1",
            "metrics": [
                {
                    "timestamp": 1588618800000000000,
                    "value": 1
                },
                {
                    "timestamp": 1588618900000000000,
                    "value": 2
                }
            ]
        }
    ]
}

edited Jan 27, 2021 at 13:48

answered Jan 26, 2021 at 22:37

sim

1,25714 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Graham McD Over a year ago

Sim, thank you for your help and solution. one slight issue still is that the second account entry has both timestamp / value included and the value seems to have increased. I only need to have the timestamp / value data relating to the account_id. Suggestions would be appreciated.

sim Over a year ago

@GrahamMcD: If you use the code before the edit, that will just produce single (timestamp, value)-entries for each record.

Graham McD Over a year ago

thank you again i needed to add a '[' to the metrics line "metrics": [ - Now works great. Thank you again for your help, much appreciated.

sim Over a year ago

@GrahamMcD: You are right, sorry, didn't catch that.

Graham McD Over a year ago

it would be great to know how option 2 above could be made to work, so as to only include the 1 instance of timestamp. Just for knowledge, if you have any thoughts.

|

Collectives™ on Stack Overflow

Convert PANDAS dataframe to nested JSON + add array name

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related