1

I am relatively new to JSON and after going through several tutorials I have decided to reach out here for additional advice. I am trying to convert the following JSON dictionary via Jupyternotebook/Pandas into a dataframe and later into a csv file (data is extracted from a public API).

Example JSON dictionary without indentation:

json_search_result = 
{'inactive_count': 0, 'links': {'self': '/company/08892216/officers'}, 'etag': '1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8', 'kind': 'officer-list', 'resigned_count': 0, 'total_results': 1, 'active_count': 1, 'items_per_page': 35, 'items': [{'date_of_birth': {'year': 1964, 'month': 2}, 'links': {'self': '/company/08892216/appointments/Ao1-pIsFQHy0FVSv26y6ep4Qhyw', 'officer': {'appointments': '/officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointments'}}, 'officer_role': 'director', 'address': {'premises': '19', 'locality': 'London', 'country': 'United Kingdom', 'address_line_1': 'Norcott Road', 'postal_code': 'N16 7EJ'}, 'appointed_on': '2014-02-12', 'nationality': 'Irish', 'country_of_residence': 'England', 'name': 'FLOOD, Eileen', 'occupation': 'Company Director'}], 'start_index': 0}

With indentation for better visibility:

json_search_result = 
{'inactive_count': 0, 'links': {'self': '/company/08892216/officers'},
 'etag': '1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8', 'kind': 'officer-list', 
'resigned_count': 0, 'total_results': 1, 'active_count': 1, 'items_per_page': 35, 'items': 
[{'date_of_birth': {'year': 1964, 'month': 2}, 'links': {'self': '/company/08892216/appointments/Ao1-pIsFQHy0FVSv26y6ep4Qhyw', 'officer': {'appointments': 
'/officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointments'}}, 'officer_role': 'director', 'address': {'premises': '19', 'locality': 'London', 'country': 'United Kingdom', 
'address_line_1': 'Norcott Road', 'postal_code': 'N16 7EJ'},
 'appointed_on': '2014-02-12', 'nationality': 'Irish', 'country_of_residence': 'England', 
'name': 'FLOOD, Eileen', 'occupation': 'Company Director'}], 'start_index': 0}

I have tried among others json_normalize but I either get errors given this is a dictionary:

df2 =  pd.json_normalize(json_search_result)

Thank you for any advice on this issues!

2 Answers 2

1

I assume that you will also want to flatten your dataframe since it contains additional nested information. So start by transforming you json string to a dataframe:

json_search_result = {'inactive_count': 0, 'links': {'self': '/company/08892216/officers'}, 'etag': '1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8', 'kind': 'officer-list', 'resigned_count': 0, 'total_results': 1, 'active_count': 1, 'items_per_page': 35, 'items': [{'date_of_birth': {'year': 1964, 'month': 2}, 'links': {'self': '/company/08892216/appointments/Ao1-pIsFQHy0FVSv26y6ep4Qhyw', 'officer': {'appointments': '/officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointments'}}, 'officer_role': 'director', 'address': {'premises': '19', 'locality': 'London', 'country': 'United Kingdom', 'address_line_1': 'Norcott Road', 'postal_code': 'N16 7EJ'}, 'appointed_on': '2014-02-12', 'nationality': 'Irish', 'country_of_residence': 'England', 'name': 'FLOOD, Eileen', 'occupation': 'Company Director'}], 'start_index': 0}
A = pd.DataFrame.from_dict(json_search_result, orient='index').T

which gives:

inactive_count                                   links  \
0              0  {'self': '/company/08892216/officers'}   

                                       etag          kind resigned_count  \
0  1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8  officer-list              0   

  total_results active_count items_per_page  \
0             1            1             35   

                                               items start_index  
0  [{'date_of_birth': {'year': 1964, 'month': 2},...           0  
​

Now, this function will flatten ANY dataframe with nested json strings:


def flatten_nested_json_df(df):
    df = df.reset_index()
    s = (df.applymap(type) == list).all()
    list_columns = s[s].index.tolist()

    s = (df.applymap(type) == dict).all()
    dict_columns = s[s].index.tolist()

    while len(list_columns) > 0 or len(dict_columns) > 0:
        new_columns = []

        for col in dict_columns:
            exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
            exploded.index = df.index
            df = pd.concat([df, exploded], axis=1).drop(columns=[col])
            new_columns.extend(exploded.columns)  # inplace

        for col in list_columns:
            # print(f"exploding: {col}")
            df = df.drop(columns=[col]).join(df[col].explode().to_frame())
            new_columns.append(col)

        s = (df[new_columns].applymap(type) == list).all()
        list_columns = s[s].index.tolist()

        s = (df[new_columns].applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()
    return df

Applying it:

flatten_nested_json_df(A)

will result in:

  index inactive_count                                      etag  \
0      0              0  1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8   

           kind resigned_count total_results active_count items_per_page  \
0  officer-list              0             1            1             35   

  start_index                  links.self  ...  items.occupation  \
0           0  /company/08892216/officers  ...  Company Director   

  items.date_of_birth.year items.date_of_birth.month  \
0                     1964                         2   

                                    items.links.self  \
0  /company/08892216/appointments/Ao1-pIsFQHy0FVS...   

                    items.links.officer.appointments items.address.premises  \
0  /officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointm...                     19   

   items.address.locality  items.address.country items.address.address_line_1  \
0                  London         United Kingdom                 Norcott Road   

  items.address.postal_code  
0                   N16 7EJ 

As you can see, the nested fields have been made to columns:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 25 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   index                             1 non-null      int64 
 1   inactive_count                    1 non-null      object
 2   etag                              1 non-null      object
 3   kind                              1 non-null      object
 4   resigned_count                    1 non-null      object
 5   total_results                     1 non-null      object
 6   active_count                      1 non-null      object
 7   items_per_page                    1 non-null      object
 8   start_index                       1 non-null      object
 9   links.self                        1 non-null      object
 10  items.officer_role                1 non-null      object
 11  items.appointed_on                1 non-null      object
 12  items.nationality                 1 non-null      object
 13  items.country_of_residence        1 non-null      object
 14  items.name                        1 non-null      object
 15  items.occupation                  1 non-null      object
 16  items.date_of_birth.year          1 non-null      int64 
 17  items.date_of_birth.month         1 non-null      int64 
 18  items.links.self                  1 non-null      object
 19  items.links.officer.appointments  1 non-null      object
 20  items.address.premises            1 non-null      object
 21  items.address.locality            1 non-null      object
 22  items.address.country             1 non-null      object
 23  items.address.address_line_1      1 non-null      object
 24  items.address.postal_code         1 non-null      object
dtypes: int64(3), object(22)
memory usage: 328.0+ bytes
None
Sign up to request clarification or add additional context in comments.

Comments

0

Please try this if this helps your problem. Good reference for further improvements: Convert list of dictionaries to a pandas DataFrame

json_search_result = {'inactive_count': 0, 'links': {'self': '/company/08892216/officers'}, 'etag': '1ccc4c74454bc359a9b9ef686d11e70cc03b6cb8', 'kind': 'officer-list', 'resigned_count': 0, 'total_results': 1, 'active_count': 1, 'items_per_page': 35, 'items': [{'date_of_birth': {'year': 1964, 'month': 2}, 'links': {'self': '/company/08892216/appointments/Ao1-pIsFQHy0FVSv26y6ep4Qhyw', 'officer': {'appointments': '/officers/JpJgO2pKVmGq5VHD7j7lluLW-jo/appointments'}}, 'officer_role': 'director', 'address': {'premises': '19', 'locality': 'London', 'country': 'United Kingdom', 'address_line_1': 'Norcott Road', 'postal_code': 'N16 7EJ'}, 'appointed_on': '2014-02-12', 'nationality': 'Irish', 'country_of_residence': 'England', 'name': 'FLOOD, Eileen', 'occupation': 'Company Director'}], 'start_index': 0}

pprint(json_search_result)

out = pd.DataFrame.from_dict(json_search_result, orient='index').T
print(out)
print(out.columns)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.