0

I am creating a nested json and i am storing it in a list object. Here is my code which is getting the proper hierarchical json as intended.

Sample Data:

enter image description here

datasource,datasource_cnt,category,category_cnt,subcategory,subcategory_cnt Bureau of Labor Statistics,44,Employment and wages,44,Employment and wages,44

import pandas as pd
df=pd.read_csv('queryhive16273.csv')
def split_df(df):
   for (vendor, count), df_vendor in df.groupby(["datasource", "datasource_cnt"]):
       yield {
           "vendor_name": vendor,
           "count": count,
           "categories": list(split_category(df_vendor))
       }

def split_category(df_vendor):
   for (category, count), df_category in df_vendor.groupby(
       ["category", "category_cnt"]
   ):
       yield {
           "name": category,
           "count": count,
           "subCategories": list(split_subcategory(df_category)),
       }

def split_subcategory(df_category):
   for (subcategory, count), df_subcategory in df_category.groupby(
       ["subcategory", "subcategory_cnt"]
   ):
       yield {
           "count": count,
           "name": subcategory,
             }


abc=list(split_df(df))

abc is containing the data as shown below. This is the intended result.

[{
    'count': 44,
    'vendor_name': 'Bureau of Labor Statistics',
    'categories': [{
        'count': 44,
        'name': 'Employment and wages',
        'subCategories': [{
            'count': 44,
            'name': 'Employment and wages'
        }]
    }]
}]

Now I am trying to store it into a json file.

with open('your_file2.json', 'w') as f:
    for item in abc:
       f.write("%s\n" % item)
        #f.write(abc)

Here comes the issue. This writes data in this fashion( refer below) which is not a valid json format. If i try to use json dump, it gives "json serialize error"

Could you please help me out here.

{
    'count': 44,
    'vendor_name': 'Bureau of Labor Statistics',
    'categories': [{
        'count': 44,
        'name': 'Employment and wages',
        'subCategories': [{
            'count': 44,
            'name': 'Employment and wages'
        }]
    }]
}

Expected Result :

[{
    "count": 44,
    "vendor_name": "Bureau of Labor Statistics",
    "categories": [{
        "count": 44,
        "name": "Employment and wages",
        "subCategories": [{
            "count": 44,
            "name": "Employment and wages"
        }]
    }]
}]
2
  • Do not print JSON by yourself (DRTW), it is not a good idea, use encoder instead. In this case the standard breaks because you print out single quote instead of double. Commented Dec 5, 2018 at 7:35
  • Is there an answer that fit your request? If so, you should mark it. Commented Dec 5, 2018 at 10:11

2 Answers 2

1

Using your data and PSL json gives me:

TypeError: Object of type 'int64' is not JSON serializable

Which just means some numpy object is living in your nested structure and does not have an encode method to convert it for JSON serialization.

Forcing encode to use string conversion when it lacks in the object itself is enough to make your code works:

import io
d = io.StringIO("datasource,datasource_cnt,category,category_cnt,subcategory,subcategory_cnt\nBureau of Labor Statistics,44,Employment and wages,44,Employment and wages,44")
df=pd.read_csv(d)

abc=list(split_df(df))

import json
json.dumps(abc, default=str)

It returns a valid JSON (but with int converted into str):

'[{"vendor_name": "Bureau of Labor Statistics", "count": "44", "categories": [{"name": "Employment and wages", "count": "44", "subCategories": [{"count": "44", "name": "Employment and wages"}]}]}]'

If it does not suit your needs, then use a dedicated Encoder:

import numpy as np
class MyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.int64):
            return int(obj)
        return json.JSONEncoder.default(self, obj)

json.dumps(abc, cls=MyEncoder)

This returns the requested JSON:

'[{"vendor_name": "Bureau of Labor Statistics", "count": 44, "categories": [{"name": "Employment and wages", "count": 44, "subCategories": [{"count": 44, "name": "Employment and wages"}]}]}]'

Another option is to directly convert your data before encoding:

def split_category(df_vendor):
   for (category, count), df_category in df_vendor.groupby(
       ["category", "category_cnt"]
   ):
       yield {
           "name": category,
           "count": int(count), # Cast here before encoding
           "subCategories": list(split_subcategory(df_category)),
       }
Sign up to request clarification or add additional context in comments.

2 Comments

how to write these to a text file ? Could you please write that line in your json.dump
@ShankarPanda Just use dump instead of dumps as @Buran did in its answer.
0
import json

data = [{
    'count': 44,
    'vendor_name': 'Bureau of Labor Statistics',
    'categories': [{
        'count': 44,
        'name': 'Employment and wages',
        'subCategories': [{
            'count': 44,
            'name': 'Employment and wages'
        }]
    }]
}]

with open('your_file2.json', 'w') as f:
    json.dump(data, f, indent=2)

produces a valid JSON file:

[
  {
    "count": 44,
    "vendor_name": "Bureau of Labor Statistics",
    "categories": [
      {
        "count": 44,
        "name": "Employment and wages",
        "subCategories": [
          {
            "count": 44,
            "name": "Employment and wages"
          }
        ]
      }
    ]
  }
]

5 Comments

This will not work with the trial dataset because it has numpy.int64 instead of int within its structure. You skipped this part because you imported it as a Python structure not from the file.
@jlandercy I am using what OP is provided as abc value in their post. They say "abc is containing the data as shown below. This is the intended result.". I don't see where do you get any other trial dataset. Clearly their problem is because they iterate over the list and write each element as text in a simple txt file and produce invalid json.
Look to my answer, I did find the relevant data in the OP. You solution will not work with its dataset: copy paste StringIO and you will be able to reproduce the issue. This does not mean your answer is wrong, it just will not solve OP issue.
@jlandercy, I see it now - it's because they use pandas to read csv in a dataframe. Eventually they can convert count to int when yield and it should solve the issue
Yes, this where numpy.int64 comes from, already suggested. Have a good day.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.