0

I am trying to build a DataFrame using pandas but I am not able to handle the case when I have the variable size of JSON chunks I am getting.

eg: 1st chunk:

{'ad': 0,
 'country': 'US',
 'ver': '1.0',
 'adIdType': 2,
 'adValue': '5',
 'data': {'eventId': 99,
  'clickId': '',
  'eventType': 'PURCHASEMADE',
  'tms': '2019-12-25T09:57:04+0000',
  'productDetails': {'currency': 'DLR',
   'productList': [
    {'segment': 'Girls',
     'vertical': 'Fashion Jewellery',
     'brickname': 'Traditional Jewellery',
     'price': 8,
     'quantity': 10}]},
  'transactionId': '1254'},
 'appName': 'xer.tt',
 'appId': 'XR',
 'sdkVer': '1.0.0',
 'language': 'en',
 'tms': '2022-04-25T09:57:04+0000',
 'tid': '124'}

2nd chunk:

{'ad': 0,
 'country': 'US',
 'ver': '1.0',
 'adIdType': 2,
 'adValue': '78',
 'data': {'eventId': 7,
  'clickId': '',
  'eventType': 'PURCHASEMADE',
  'tms': '20219-02-25T09:57:04+0000',
  'productDetails': {'currency': 'DLR',
   'productList': [{'segment': 'Boys',
     'vertical': 'Fashion',
     'brickname': 'Casuals',
     'price': 10,
     'quantity': 5},
    {'segment': 'Girls',
     'vertical': 'Fashion Jewellery',
     'brickname': 'Traditional Jewellery',
     'price': 8,
     'quantity': 10}]},
  'transactionId': '3258'},
 'appName': 'xer.tt',
 'appId': 'XR',
 'sdkVer': '1.0.0',
 'language': 'en',
 'tms': '2029-02-25T09:57:04+0000',
 'tid': '124'}

Now in the ProductDetails the number of products are getting changes, in the first chunk we have only 1 product listed and it's detailed but in the 2nd chunk, we have 2 products listed and it's detailed, for further chunks we can have ANY number of products for other chunks also. (i.e. chunks~Records)

I tried doing that by writing some python scripts but was not able to come to any good solution.

PS: If any further detail is required please let me know in the comments.

Thanks!

2
  • How does your desire dataframe look like? Commented Apr 29, 2022 at 6:16
  • @Joe I am still thinking about it. because if people have this type of data then how do they manage it and do its further preprocessing. trying to think in a way like: [other_cols] [product_1] [product_2] [product_3] and so on but this may be a bad approach because there are many chances that most of the value will go NAN. can Joe we have some other libraries or some utility in python which can handle this type of case in another way, I am totally unaware of this type of data. Ps: I need to build a recommendation engine using ML. and for that, I need it to parse this. Commented Apr 29, 2022 at 6:28

1 Answer 1

1

What you can do, is use pd.json_normalize and have the most "inner" dictionary as your record_path and all other data you are interested in as your meta . Here is an in-depth example how you could construct that: pandas.io.json.json_normalize with very nested json

In your case, that would for example be (for a single object):

df = pd.json_normalize(obj, 
                         record_path=["data", "productDetails", "productList"], 
                         meta=([
                             ["data", "productDetails", "currency"],
                             ["data", "transactionId"],
                             ["data", "clickId"],
                             ["data", "eventType"],
                             ["data", "tms"],
                             "ad",
                             "country"
                             ])
)
 
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot, @braml1 looking into it. Thanks for your time.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.