Use json_normalize to normalize json with nested arrays

Question

I would like to normalize the following JSON:

[
    {
        "studentId": 1,
        "studentName": "James",
        "schools": [
            {
                "schoolId": 1,
                "classRooms": [
                    {
                        "classRoomId": {
                            "id": 1,
                            "floor": 2
                        }
                    },
                    {
                        "classRoomId": {
                            "id": 3
                        }
                    },
                ],
                "teachers": [
                    {
                        "teacherId": 1,
                        "teacherName": "Tom"
                    },
                    {
                        "teacherId": 2,
                        "teacherName": "Sarah"
                    }
                ]
            },
            {
                "schoolId": 2,
                "classRooms": [
                    {
                        "classRoomId": {
                            "id": 4
                        }
                    }
                ],
                "teachers": [
                    {
                        "teacherId": 1,
                        "teacherName": "Tom"
                    },
                    {
                        "teacherId": 2,
                        "teacherName": "Sarah"
                    },
                    {
                        "teacherId": 3,
                        "teacherName": "Tara"
                    }
                ]
            }
        ]
    }
]

And I would like to get the following table in Python (tabular form):

studentId studentName schoolId classRoomId.id classRoomId.floor teacherId 
teacherName
1 James 1 1 2 1 Tom
1 James 1 1 2 2 Sarah
1 James 1 3   1 Tom
1 James 1 3   2 Sarah
1 James 2 4   1 Tom
1 James 2 4   2 Sarah
1 James 2 4   3 Tara

I've tried to use Pandas json_normalize function like this:

df1 = json_normalize(test1, ["schools","teachers"], ["studentId", "studentName",["schools","teachers"]])
df2 = json_normalize(test1, ["schools","classRooms"], ["studentId", "studentName",["schools","classRooms"]])
df = pd.concat([df1,df2],axis=1)

But that doesn't give me the structure I need.

It doesn't have to be in Pandas, any other library or code in Python would do. Any help is appreciate it. Thank you.

Does this question help? stackoverflow.com/questions/53643406/… — alexbclay
– alexbclay, Commented Aug 10, 2019 at 1:36

Code Different · Accepted Answer · 2019-08-12 14:14:01Z

4

Because classRooms and teachers form two different subtrees of the JSON, you will have to parse them twice:

classrooms = pd.io.json.json_normalize(json_data, ['schools', 'classRooms'], meta=[
    'studentId',
    'studentName',
    ['schools', 'schoolId']
])

teachers = pd.io.json.json_normalize(json_data, ['schools', 'teachers'], meta=[
    'studentId',
    ['schools', 'schoolId']
])

# Merge and rearrange the columns in the order of your sample output
classrooms.merge(teachers, on=['schools.schoolId', 'studentId']) \
    [['studentId', 'studentName', 'schools.schoolId', 'classRoomId.id', 'classRoomId.floor', 'teacherId', 'teacherName']]

edited Aug 12, 2019 at 14:14

answered Aug 10, 2019 at 2:28

Code Different

93.4k16 gold badges154 silver badges175 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Esfandiar Over a year ago

Perfect, that gave me the exact result. Thank you very much. Now I just need to come up with an algorithm to do it more generically. It seems like there should be something out there that would do this in a generic way.

ghukill Over a year ago

@Esfandiar, did you ever come up with a solution? I would agree, seems like this must be well established territory.

Jeyes Unterwegs · Accepted Answer · 2025-05-07 07:19:41Z

Here is a generalized solution to json_normalize the JSON arrays present in dataframe cells after applying pd.json_normalize:

from typing import Optional
import pandas as pd

def explode_nested_json(
        first_level_df: pd.DataFrame,
        type2column: dict[str, str],
        type_column_name: Optional[str] = None,
        **kwargs
) -> pd.DataFrame:
    """ Explodes columns containing JSON arrays, joins them onto the other existing columns,
    and concatenates the resulting dataframes.

    Args:
        first_level_df:
            A DataFrame resulting from pd.json_normalize which has at least one column containing JSON arrays.
            The index needs to be unique.
        type2column:
            A mapping of an arbitrary type name to the column to be exploded using pd.json_normalize.
            The type names are only relevant when type_column_name is specified.
        type_column_name:
            Name of the indicator column that specifies from which exploded column a record comes.
        kwargs:
            Keyword arguments passed to pd.json_normalize when exploding any array.

    Returns:
        A Dataframe with one row per item in the JSON arrays in the specified columns.
        If type_column_name is specified, it comes with an indicator column of that name.
    """
    assert all(col in first_level_df.columns for col in type2column.values()), f"Not all columns specified in type2column are present: {first_level_df.columns}"
    assert first_level_df.index.is_unique, "Dataframe index needs to be unique. Please de-duplicate/reset."

    def explode_arrays(df: pd.DataFrame) -> Optional[pd.DataFrame]:
        """Takes a single-row DataFrame """
        dfs = {}
        row = df.iloc[0]
        for event_type, column in type2column.items():
            try:
                dfs[event_type] = pd.json_normalize(row[column], **kwargs)
            except Exception:
                continue
        if not dfs:
            return
        if type_column_name:
            return pd.concat(dfs, names=[type_column_name]).droplevel(-1)
        return pd.concat(dfs).droplevel(-1)

    drop_type_column = not bool(type_column_name)
    second_level_df = (
        first_level_df
        .groupby(level=0)
        .apply(explode_arrays)
        .reset_index(level=-1, drop=drop_type_column)
    )
    return (
        first_level_df
        .drop(columns=type2column.values())
        .join(second_level_df)
        .reset_index(drop=True)
    )

first = explode_nested_json(pd.json_normalize(test1), dict(school="schools"))
explode_nested_json(first, dict(class_room="classRooms", teacher="teachers"), type_column_name="record_type")

Output:

	studentId	studentName	schoolId	record_type	classRoomId.id	classRoomId.floor	teacherId	teacherName
0	1	James	1	class_room	1	2	nan	nan
1	1	James	1	class_room	3	nan	nan	nan
2	1	James	1	teacher	nan	nan	1	Tom
3	1	James	1	teacher	nan	nan	2	Sarah
4	1	James	2	class_room	4	nan	nan	nan
5	1	James	2	teacher	nan	nan	1	Tom
6	1	James	2	teacher	nan	nan	2	Sarah
7	1	James	2	teacher	nan	nan	3	Tara

Collectives™ on Stack Overflow

Use json_normalize to normalize json with nested arrays

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related