Flattening the nested json file to dataframe using pandas json_normalise

Question

I hace a big json file data and I want to convert it in to tabular form. I am trying to flatten the data in to dataframe using json_nomalise. so Far I have this :

code so far

I want to further flatten the submissions and product data in columns i tried this:

submission_data = pd.json_normalize(data=rawData['results'], record_path=rawData['results']['submissions'], meta=['application_number', 'sponsor_name'] , errors='ignore') submission_data.head(3)

But I am getting error saying: TypeError: list indices must be integers or slices, not str

Any output on this will be helpful

Lourenço Monteiro Rodrigues · Accepted Answer · 2023-11-03 15:59:15Z

0

As submissions and Products are lists (and not objects with a regular structure), JSON_normalize will leave them untouched. Also, given that they are lists, can you make sure that they are always the same number for each record? If not, distributing them trough columns makes no sense. If submissions and products are pairs (i.e. if every submission corresponds to one product) you can consider distributing along lines (In a melting dataframe strategy).

finally, regarding the error, raw_data seems to be a list of objects that contain a 'results' field. That means you cannot retrieve directly raw_data['results'], but only raw_data[0]['results'] to get the results from the first object.

Adding a solution proposition

Given your data structure, what I would do is the following:

normalize the raw_data as you do in the notebook.
for each line of the resulting dataframe: a. normalize the json in 'submissions' field b. change the column names of that resulting dataframe to 'submissions.<column_name>'. c. add a column with value equal to the application number of the line you are evaluating. d. add that resulting df to a list, collecting all such dataframes
concatenate those dataframes
merge the original dataframe with the concatenated one using 'application_number' as the key, and drop the submissions column.

Repeat the process for the 'products'; however, unless you know the relationship between submissions and products, there is no clear way of merging the dataframes you get:

If they have no relationship except for being under the same application number, you basically get separate datasets for each.
If there is a one-to-one relationship, you can just merge them by index (concatenate each line)

in code:

df = pd.normalize_json(raw_data)

submissions = []
products = []

for i, line in df.iterrows():
    temp_df_sub = pd.normalize_json(line['submissions'])
    temp_df_sub.cols = list(map(lambda x: f'submissions.{x}', temp_df_sub)
    temp_df_sub['application_number'] = line['application_number']
    submissions.append(temp_df_sub)

    temp_df_prod = pd.normalize_json(line['products'])
    temp_df_prod.cols = list(map(lambda x: f'products.{x}', temp_df_sub)
    temp_df_prod['application_number'] = line['application_number']
    products.append(temp_df_prod)

submissions_df = pd.concat(submissions)
products_df = pd.concat(products)


# if one-to-one relationship between submissions and products
sub_prod_df = pd.concat([submissions_df, products_df], axis=1)
final_df = df.merge(sub_prod_df, on='application_number')


# if no relationship
final_sub_df = submissions_df.merge(df, on='application_number')
final_prod_df = products_df.merge(df, on='application_number')

edited Nov 3, 2023 at 15:59

answered Nov 3, 2023 at 14:24

Lourenço Monteiro Rodrigues

3581 silver badge7 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Nano Over a year ago

Thanks, Yess the lists are always the samew number. These lists further contains the dictionary which I want to extract in to table columns with the other data. The structure of the submmisions as follow:

Lourenço Monteiro Rodrigues Over a year ago

And are submissions and products related in a one-to-one fashion?

Lourenço Monteiro Rodrigues Over a year ago

But anyway, then you would have the same problem with the application_docs, that is also a list with objects; is this one also guaranteed to have the same number of elements in every instance?

Collectives™ on Stack Overflow

Flattening the nested json file to dataframe using pandas json_normalise

1 Answer 1

Adding a solution proposition

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Adding a solution proposition

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related