How to handle the variable size json file in python to create DataFrame using pandas

Question

I am trying to build a DataFrame using pandas but I am not able to handle the case when I have the variable size of JSON chunks I am getting.

eg: 1st chunk:

{'ad': 0,
 'country': 'US',
 'ver': '1.0',
 'adIdType': 2,
 'adValue': '5',
 'data': {'eventId': 99,
  'clickId': '',
  'eventType': 'PURCHASEMADE',
  'tms': '2019-12-25T09:57:04+0000',
  'productDetails': {'currency': 'DLR',
   'productList': [
    {'segment': 'Girls',
     'vertical': 'Fashion Jewellery',
     'brickname': 'Traditional Jewellery',
     'price': 8,
     'quantity': 10}]},
  'transactionId': '1254'},
 'appName': 'xer.tt',
 'appId': 'XR',
 'sdkVer': '1.0.0',
 'language': 'en',
 'tms': '2022-04-25T09:57:04+0000',
 'tid': '124'}

2nd chunk:

{'ad': 0,
 'country': 'US',
 'ver': '1.0',
 'adIdType': 2,
 'adValue': '78',
 'data': {'eventId': 7,
  'clickId': '',
  'eventType': 'PURCHASEMADE',
  'tms': '20219-02-25T09:57:04+0000',
  'productDetails': {'currency': 'DLR',
   'productList': [{'segment': 'Boys',
     'vertical': 'Fashion',
     'brickname': 'Casuals',
     'price': 10,
     'quantity': 5},
    {'segment': 'Girls',
     'vertical': 'Fashion Jewellery',
     'brickname': 'Traditional Jewellery',
     'price': 8,
     'quantity': 10}]},
  'transactionId': '3258'},
 'appName': 'xer.tt',
 'appId': 'XR',
 'sdkVer': '1.0.0',
 'language': 'en',
 'tms': '2029-02-25T09:57:04+0000',
 'tid': '124'}

Now in the ProductDetails the number of products are getting changes, in the first chunk we have only 1 product listed and it's detailed but in the 2nd chunk, we have 2 products listed and it's detailed, for further chunks we can have ANY number of products for other chunks also. (i.e. chunks~Records)

I tried doing that by writing some python scripts but was not able to come to any good solution.

PS: If any further detail is required please let me know in the comments.

Thanks!

@Joe I am still thinking about it. because if people have this type of data then how do they manage it and do its further preprocessing. trying to think in a way like: [other_cols] [product_1] [product_2] [product_3] and so on but this may be a bad approach because there are many chances that most of the value will go NAN. can Joe we have some other libraries or some utility in python which can handle this type of case in another way, I am totally unaware of this type of data. Ps: I need to build a recommendation engine using ML. and for that, I need it to parse this. — codex
– codex, Commented Apr 29, 2022 at 6:28

braml1 · Accepted Answer · 2022-04-29 09:55:18Z

1

What you can do, is use pd.json_normalize and have the most "inner" dictionary as your record_path and all other data you are interested in as your meta . Here is an in-depth example how you could construct that: pandas.io.json.json_normalize with very nested json

In your case, that would for example be (for a single object):

df = pd.json_normalize(obj, 
                         record_path=["data", "productDetails", "productList"], 
                         meta=([
                             ["data", "productDetails", "currency"],
                             ["data", "transactionId"],
                             ["data", "clickId"],
                             ["data", "eventType"],
                             ["data", "tms"],
                             "ad",
                             "country"
                             ])
)

answered Apr 29, 2022 at 9:55

braml1

5843 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

codex Over a year ago

Thanks a lot, @braml1 looking into it. Thanks for your time.

Collectives™ on Stack Overflow

How to handle the variable size json file in python to create DataFrame using pandas

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related