Polars: Unnesting columns algorithmically without a for loop

Question

I am working with multiple parquet datasets that were written with nested structs (sometimes multiple levels deep). I need to output a flattened (no struct) schema. Right now the only way I can think to do that is to use for loops to iterate through the columns. Here is a simplified example where I'm for looping.

while len([x.name for x in df if x.dtype == pl.Struct]) > 0:
    for col in df:
        if col.dtype == pl.Struct:
            df = df.unnest(col.name)

This works, maybe that is the only way to do it, and if so it would be helpful to know that. But Polars is pretty neat and I'm wondering if there is a more functional way to do this without all the looping and reassigning the df to itself.

Can you add a minimal example? It is hard to give an answer if we have to come up with the data and the expected result as well. — ritchie46
– ritchie46, Commented Dec 1, 2022 at 9:39
@ritchie46 I'm guessing they're after something like in my answer. One question for you, is unnest internally parallelized when giving it a list of columns? — Dean MacGregor
– Dean MacGregor, Commented Dec 1, 2022 at 16:17
No, an unnest operation is free. :) Wrapping columns in and out of structs is always free if the chunks ar equal. — ritchie46
– ritchie46, Commented Dec 1, 2022 at 16:45

jqurious · Accepted Answer · 2024-10-21 09:16:59Z

Example setup

If you have a df like this:

df = pl.DataFrame(
    {"a": [1, 2, 3], "b": [2, 3, 4], "c": [3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
).select(pl.struct("a", "b").alias("ab"), pl.struct("c", "d").alias("cd"), "e")

Basic non-nested multi-column unnest

You can unnest the ab and cd at the same time by just doing

df.unnest('ab','cd')

If you don't know in advance what your column names and types are in advance then you can just use a list comprehension/generator like this:

df.unnest(col_name for col_name, dtype in df.schema.items() if dtype == pl.Struct)

Nested Example

If you have structs inside structs like:

df = (
    pl.DataFrame(
        {"a": [1, 2, 3], "b": [2, 3, 4], "c": [3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
    )
    .select(pl.struct("a", "b").alias("ab"), pl.struct("c", "d").alias("cd"), "e")
    .select(pl.struct("ab", "cd").alias("abcd"), "e")
)

then I don't think you can get away from some kind of while loop but this might be more concise:

Repeated manipulation of df

This method is the most concise and intuitive means to unnest nested structs as it repeatedly unnests the frame and then checks if there are structs until there aren't any more structs

while any(x==pl.Struct for x in df.dtypes):
    df=df.unnest(col_name for col_name,dtype in df.schema.items() if dtype==pl.Struct)

Alternative method that works for lazyframes

The previous method requires repeatedly setting df. While that does NOT mean it is repeatedly copying any of the underlying data it does mean it wouldn't work on lazyframes without materializing. Because the schema contains the entire nested structure we can use the schema to generate expressions that are lazyframe friendly. There is also something satisfying in not having to reassign the df even knowing that it doesn't produce copies in the process.

def unnest_all(df):
    exprs = []
    def parse_struct_field(struct_field, base_expr, base_name):
        ret_list = []
        for field in struct_field.fields:
            if field.dtype == pl.Struct:
                ret_list.extend(
                    parse_struct_field(
                        field.dtype,
                        base_expr.struct.field(field.name),
                        f"{base_name}_{field.name}")
                )
            else:
                ret_list.append(
                    base_expr.struct.field(field.name)
                    .alias(f"{base_name}_{field.name}"))
        return ret_list

    # first_level_structs = [x for x, y in df.schema.items() if y == pl.Struct]
    for col in df.columns:
        if df.schema[col]==pl.Struct:
            exprs.extend(parse_struct_field(df.schema[col], pl.col(col), col))
        else:
            exprs.append(col)
    return df.select(*exprs)

This function works by looping through the schema and creating a list of expressions for each outer column. If the outer column is a schema then it calls the parse_struct_field function which itself loops through the inner fields of the struct creating expressions that extract the inner field as a standalone column. When one of those fields is also a struct it just calls itself using recursion to deal with any level of nesting. To deal with name conflicts it keeps track of all the outer column names and prepends that to each inner column so you end up with column names like "abcd_ab_a".

This method will fail for lazyframes that produce a struct opaquely. By that I mean methods like .list.to_struct or str.extract_groups where it can't produce a schema without materializing.

jqurious · Accepted Answer · 2024-10-21 09:05:43Z

This is a minor addition. If you're concerned about constantly re-looping through a large number of columns, you can create a recursive formula to address only structs (and nested structs).

def unnest_all(self: pl.DataFrame):
    cols = []
    for next_col in self:
        if next_col.dtype != pl.Struct:
            cols.append(next_col)
        else:
            cols.extend(next_col.struct.unnest().unnest_all().get_columns())

    return pl.DataFrame(cols)


pl.DataFrame.unnest_all = unnest_all

So, using the second example by @Dean MacGregor above:

df = (
    pl.DataFrame(
        {"a": [1, 2, 3], "b": [2, 3, 4], "c": [
            3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
    )
    .select(pl.struct("a", "b").alias("ab"), pl.struct("c", "d").alias("cd"), "e")
    .select(pl.struct("ab", "cd").alias("abcd"), "e")
)
df
df.unnest_all()

>>> df
shape: (3, 2)
┌───────────────┬─────┐
│ abcd          ┆ e   │
│ ---           ┆ --- │
│ struct[2]     ┆ i64 │
╞═══════════════╪═════╡
│ {{1,2},{3,4}} ┆ 5   │
│ {{2,3},{4,5}} ┆ 6   │
│ {{3,4},{5,6}} ┆ 7   │
└───────────────┴─────┘
>>> df.unnest_all()
shape: (3, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a   ┆ b   ┆ c   ┆ d   ┆ e   │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   ┆ 4   ┆ 5   │
│ 2   ┆ 3   ┆ 4   ┆ 5   ┆ 6   │
│ 3   ┆ 4   ┆ 5   ┆ 6   ┆ 7   │
└─────┴─────┴─────┴─────┴─────┘

And using the first example:

df = pl.DataFrame(
    {"a": [1, 2, 3], "b": [2, 3, 4], "c": [
        3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
).select(pl.struct("a", "b").alias("ab"), pl.struct("c", "d").alias("cd"), "e")
df
df.unnest_all()

>>> df
shape: (3, 3)
┌───────────┬───────────┬─────┐
│ ab        ┆ cd        ┆ e   │
│ ---       ┆ ---       ┆ --- │
│ struct[2] ┆ struct[2] ┆ i64 │
╞═══════════╪═══════════╪═════╡
│ {1,2}     ┆ {3,4}     ┆ 5   │
│ {2,3}     ┆ {4,5}     ┆ 6   │
│ {3,4}     ┆ {5,6}     ┆ 7   │
└───────────┴───────────┴─────┘
>>> df.unnest_all()
shape: (3, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a   ┆ b   ┆ c   ┆ d   ┆ e   │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   ┆ 4   ┆ 5   │
│ 2   ┆ 3   ┆ 4   ┆ 5   ┆ 6   │
│ 3   ┆ 4   ┆ 5   ┆ 6   ┆ 7   │
└─────┴─────┴─────┴─────┴─────┘

In the end, I'm not sure that this saves you much wall-clock time (or RAM).

jqurious · Accepted Answer · 2024-04-18 13:34:24Z

This type of transformation is closely related to: a Polars equivalent of pandas.json_normalize()

It has come up a few times but nothing seems to have been added as of yet.

https://github.com/pola-rs/polars/issues/12353

The approach is similar to your answer in that it walks the schema to generate the expressions.

def _unnest_all(schema, separator):
    def _unnest(schema, path=[]):
        for name, dtype in schema.items():
            base_type = dtype.base_type()
            
            if base_type == pl.Struct:
                yield from _unnest(dtype.to_schema(), path + [name])
            else:
                yield path + [name], dtype
                
    for (col, *fields), dtype in _unnest(schema):
        expr = pl.col(col)
        
        for field in fields:
            expr = expr.struct[field]
            
        if col == "":
            name = separator.join(fields)
        else:
            name = separator.join([col] + fields)
        
        yield expr.alias(name)
        
def unnest_all(df, separator="."): 
    return df.select(_unnest_all(df.schema, separator))

>>> df.pipe(unnest_all)
shape: (3, 5)
┌───────────┬───────────┬───────────┬───────────┬─────┐
│ abcd.ab.a ┆ abcd.ab.b ┆ abcd.cd.c ┆ abcd.cd.d ┆ e   │
│ ---       ┆ ---       ┆ ---       ┆ ---       ┆ --- │
│ i64       ┆ i64       ┆ i64       ┆ i64       ┆ i64 │
╞═══════════╪═══════════╪═══════════╪═══════════╪═════╡
│ 1         ┆ 2         ┆ 3         ┆ 4         ┆ 5   │
│ 2         ┆ 3         ┆ 4         ┆ 5         ┆ 6   │
│ 3         ┆ 4         ┆ 5         ┆ 6         ┆ 7   │
└───────────┴───────────┴───────────┴───────────┴─────┘

jqurious · Accepted Answer · 2024-10-21 09:07:32Z

The following code will un-nest all columns recursively. As it uses only column references, it is not compute intensive.

import polars as pl
test = (
    pl.DataFrame(
        {"a": [1, 2, 3], "b": [2, 3, 4], "c": [
            3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
    )
    .select(pl.struct("a", "b").alias("ab"), pl.struct("c", "d").alias("cd"), "e")
    .select(pl.struct("ab", "cd").alias("abcd"), "e")
)

def unnest(column, type_):
    if column.dtype == pl.Struct:
        cols = []
        for field in type_.fields:
            col = column.struct.field(field.name)
            normalized = unnest(col, field.dtype)
            cols.extend(
                [v.alias(f"{column.name}.{v.name}") for v in normalized]
            )
        return cols
    else:
        return [column]

def unnest_df(df):
    schema = df.schema

    cols = []

    for next_col in df:
        cols.extend(
            unnest(next_col, schema[next_col.name])
        )

    df = pl.DataFrame(
        cols
    )

    return df

unnest_df(test)

shape: (3, 5)
┌───────────┬───────────┬───────────┬───────────┬─────┐
│ abcd.ab.a ┆ abcd.ab.b ┆ abcd.cd.c ┆ abcd.cd.d ┆ e   │
│ ---       ┆ ---       ┆ ---       ┆ ---       ┆ --- │
│ i64       ┆ i64       ┆ i64       ┆ i64       ┆ i64 │
╞═══════════╪═══════════╪═══════════╪═══════════╪═════╡
│ 1         ┆ 2         ┆ 3         ┆ 4         ┆ 5   │
│ 2         ┆ 3         ┆ 4         ┆ 5         ┆ 6   │
│ 3         ┆ 4         ┆ 5         ┆ 6         ┆ 7   │
└───────────┴───────────┴───────────┴───────────┴─────┘

jqurious · Accepted Answer · 2024-10-21 09:12:14Z

The other answers taught me a lot. I encountered a new situation where I wanted to easily be able to get each column labeled with all the structs it came from. i.e. for

pl.col("my").struct.field("test").struct.field("thing")

I wanted to recover

my.test.thing

as a string which I could easily use when reading a subset of columns with pyarrow via

pq.ParquetDataset(path).read(columns = ["my.test.thing"])

Since there are many hundreds of columns and the nesting can go quite deep, I wrote functions to do a depth first search on the schema, extract the columns in that pyarrow friendly format, then I can use those to select each column unnested all in one go.

First, I worked with the pyarrow schema because I couldn't figure out how to drill into the structs in the polars schema:

schema = df.to_arrow().schema

navigating structs in that schema is quirky, at the top level the structure behaves differently from deeper in. I ended up writing two functions, the first to navigate the top level structure and the second to continue the search below:

def schema_top_level_DFS(pa_schema):
    top_level_stack = list(range(len(pa_schema)))
    while top_level_stack:
        working_top_level_index = top_level_stack.pop()
        working_element_name = pa_schema.names[working_top_level_index]
        if type(pa_schema.types[working_top_level_index]) == pa.lib.StructType:
            second_level_stack = list(range(len(pa_schema.types[working_top_level_index])))
            while second_level_stack:
                working_second_level_index = second_level_stack.pop()
                schema_DFS(pa_schema.types[working_top_level_index][working_second_level_index],working_element_name)
        else:
            column_paths.append(working_element_name)


def schema_DFS(incoming_element,upstream_names):
    current_name = incoming_element.name
    combined_names = ".".join([upstream_names,current_name])
    if type(incoming_element.type) == pa.lib.StructType:
        stack = list(range(len(incoming_element.type)))
        while stack:
            working_index = stack.pop()
            working_element = incoming_element.type[working_index]
            schema_DFS(working_element,combined_names)
    else:
        column_paths.append(combined_names)

So that running

column_paths = []
schema_top_level_DFS(schema)

gives me column paths like

['struct_name_1.inner_struct_name_2.thing1','struct_name_1.inner_struct_name_2.thing2]

to actually do the unnesting, I wasn't sure how to do better than a function with a case statement:

def return_pl_formatting(col_string):
    col_list = col_string.split(".")
    match len(col_list):
        case 1:
            return pl.col(col_list[0]).alias(col_string)
        case 2:
            return pl.col(col_list[0]).struct.field(col_list[1]).alias(col_string)
        case 3:
            return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).alias(col_string)
        case 4:
            return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).alias(col_string)
        case 5:
            return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).struct.field(col_list[4]).alias(col_string)
        case 6:
            return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).struct.field(col_list[4]).struct.field(col_list[5]).alias(col_string)

Then get my unnested and nicely named df with:

df.select(return_pl_formatting(x) for x in column_paths)

To show the output on the example from @Dean MacGregor

test = (
    pl.DataFrame(
        {"a": [1, 2, 3], "b": [2, 3, 4], "c": [
            3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
    )
    .select(pl.struct("a", "b").alias("ab"), pl.struct("c", "d").alias("cd"), "e")
    .select(pl.struct("ab", "cd").alias("abcd"), "e")
)
column_paths = []
schema_top_level_DFS(test.to_arrow().schema)
print(test.select(return_pl_formatting(x) for x in column_paths))

┌─────┬───────────┬───────────┬───────────┬───────────┐
│ e   ┆ abcd.cd.d ┆ abcd.cd.c ┆ abcd.ab.b ┆ abcd.ab.a │
│ --- ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
│ i64 ┆ i64       ┆ i64       ┆ i64       ┆ i64       │
╞═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 5   ┆ 4         ┆ 3         ┆ 2         ┆ 1         │
│ 6   ┆ 5         ┆ 4         ┆ 3         ┆ 2         │
│ 7   ┆ 6         ┆ 5         ┆ 4         ┆ 3         │
└─────┴───────────┴───────────┴───────────┴───────────┘

I'm a novice in python, polars, and all things so I don't know if this was a good idea. But I'll leave what I did here in case anyone ever considers doing the same and wants to build off what I did. Or even better if anyone can offer improvements.

Collectives™ on Stack Overflow

Polars: Unnesting columns algorithmically without a for loop

5 Answers 5

Example setup

Basic non-nested multi-column unnest

Nested Example

Repeated manipulation of df

Alternative method that works for lazyframes

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Example setup

Basic non-nested multi-column unnest

Nested Example

Repeated manipulation of df

Alternative method that works for lazyframes

Comments

Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related