5

I would like to normalize the following JSON:

[
    {
        "studentId": 1,
        "studentName": "James",
        "schools": [
            {
                "schoolId": 1,
                "classRooms": [
                    {
                        "classRoomId": {
                            "id": 1,
                            "floor": 2
                        }
                    },
                    {
                        "classRoomId": {
                            "id": 3
                        }
                    },
                ],
                "teachers": [
                    {
                        "teacherId": 1,
                        "teacherName": "Tom"
                    },
                    {
                        "teacherId": 2,
                        "teacherName": "Sarah"
                    }
                ]
            },
            {
                "schoolId": 2,
                "classRooms": [
                    {
                        "classRoomId": {
                            "id": 4
                        }
                    }
                ],
                "teachers": [
                    {
                        "teacherId": 1,
                        "teacherName": "Tom"
                    },
                    {
                        "teacherId": 2,
                        "teacherName": "Sarah"
                    },
                    {
                        "teacherId": 3,
                        "teacherName": "Tara"
                    }
                ]
            }
        ]
    }
]

And I would like to get the following table in Python (tabular form):

studentId studentName schoolId classRoomId.id classRoomId.floor teacherId 
teacherName
1 James 1 1 2 1 Tom
1 James 1 1 2 2 Sarah
1 James 1 3   1 Tom
1 James 1 3   2 Sarah
1 James 2 4   1 Tom
1 James 2 4   2 Sarah
1 James 2 4   3 Tara

I've tried to use Pandas json_normalize function like this:

df1 = json_normalize(test1, ["schools","teachers"], ["studentId", "studentName",["schools","teachers"]])
df2 = json_normalize(test1, ["schools","classRooms"], ["studentId", "studentName",["schools","classRooms"]])
df = pd.concat([df1,df2],axis=1)

But that doesn't give me the structure I need.

It doesn't have to be in Pandas, any other library or code in Python would do. Any help is appreciate it. Thank you.

1

2 Answers 2

4

Because classRooms and teachers form two different subtrees of the JSON, you will have to parse them twice:

classrooms = pd.io.json.json_normalize(json_data, ['schools', 'classRooms'], meta=[
    'studentId',
    'studentName',
    ['schools', 'schoolId']
])

teachers = pd.io.json.json_normalize(json_data, ['schools', 'teachers'], meta=[
    'studentId',
    ['schools', 'schoolId']
])

# Merge and rearrange the columns in the order of your sample output
classrooms.merge(teachers, on=['schools.schoolId', 'studentId']) \
    [['studentId', 'studentName', 'schools.schoolId', 'classRoomId.id', 'classRoomId.floor', 'teacherId', 'teacherName']]
Sign up to request clarification or add additional context in comments.

2 Comments

Perfect, that gave me the exact result. Thank you very much. Now I just need to come up with an algorithm to do it more generically. It seems like there should be something out there that would do this in a generic way.
@Esfandiar, did you ever come up with a solution? I would agree, seems like this must be well established territory.
1

Here is a generalized solution to json_normalize the JSON arrays present in dataframe cells after applying pd.json_normalize:

from typing import Optional
import pandas as pd

def explode_nested_json(
        first_level_df: pd.DataFrame,
        type2column: dict[str, str],
        type_column_name: Optional[str] = None,
        **kwargs
) -> pd.DataFrame:
    """ Explodes columns containing JSON arrays, joins them onto the other existing columns,
    and concatenates the resulting dataframes.

    Args:
        first_level_df:
            A DataFrame resulting from pd.json_normalize which has at least one column containing JSON arrays.
            The index needs to be unique.
        type2column:
            A mapping of an arbitrary type name to the column to be exploded using pd.json_normalize.
            The type names are only relevant when type_column_name is specified.
        type_column_name:
            Name of the indicator column that specifies from which exploded column a record comes.
        kwargs:
            Keyword arguments passed to pd.json_normalize when exploding any array.

    Returns:
        A Dataframe with one row per item in the JSON arrays in the specified columns.
        If type_column_name is specified, it comes with an indicator column of that name.
    """
    assert all(col in first_level_df.columns for col in type2column.values()), f"Not all columns specified in type2column are present: {first_level_df.columns}"
    assert first_level_df.index.is_unique, "Dataframe index needs to be unique. Please de-duplicate/reset."

    def explode_arrays(df: pd.DataFrame) -> Optional[pd.DataFrame]:
        """Takes a single-row DataFrame """
        dfs = {}
        row = df.iloc[0]
        for event_type, column in type2column.items():
            try:
                dfs[event_type] = pd.json_normalize(row[column], **kwargs)
            except Exception:
                continue
        if not dfs:
            return
        if type_column_name:
            return pd.concat(dfs, names=[type_column_name]).droplevel(-1)
        return pd.concat(dfs).droplevel(-1)

    drop_type_column = not bool(type_column_name)
    second_level_df = (
        first_level_df
        .groupby(level=0)
        .apply(explode_arrays)
        .reset_index(level=-1, drop=drop_type_column)
    )
    return (
        first_level_df
        .drop(columns=type2column.values())
        .join(second_level_df)
        .reset_index(drop=True)
    )

first = explode_nested_json(pd.json_normalize(test1), dict(school="schools"))
explode_nested_json(first, dict(class_room="classRooms", teacher="teachers"), type_column_name="record_type")

Output:

studentId studentName schoolId record_type classRoomId.id classRoomId.floor teacherId teacherName
0 1 James 1 class_room 1 2 nan nan
1 1 James 1 class_room 3 nan nan nan
2 1 James 1 teacher nan nan 1 Tom
3 1 James 1 teacher nan nan 2 Sarah
4 1 James 2 class_room 4 nan nan nan
5 1 James 2 teacher nan nan 1 Tom
6 1 James 2 teacher nan nan 2 Sarah
7 1 James 2 teacher nan nan 3 Tara

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.