Example setup
If you have a df like this:
df = pl.DataFrame(
{"a": [1, 2, 3], "b": [2, 3, 4], "c": [3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
).select(pl.struct("a", "b").alias("ab"), pl.struct("c", "d").alias("cd"), "e")
Basic non-nested multi-column unnest
You can unnest the ab and cd at the same time by just doing
df.unnest('ab','cd')
If you don't know in advance what your column names and types are in advance then you can just use a list comprehension/generator like this:
df.unnest(col_name for col_name, dtype in df.schema.items() if dtype == pl.Struct)
Nested Example
If you have structs inside structs like:
df = (
pl.DataFrame(
{"a": [1, 2, 3], "b": [2, 3, 4], "c": [3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
)
.select(pl.struct("a", "b").alias("ab"), pl.struct("c", "d").alias("cd"), "e")
.select(pl.struct("ab", "cd").alias("abcd"), "e")
)
then I don't think you can get away from some kind of while loop but this might be more concise:
Repeated manipulation of df
This method is the most concise and intuitive means to unnest nested structs as it repeatedly unnests the frame and then checks if there are structs until there aren't any more structs
while any(x==pl.Struct for x in df.dtypes):
df=df.unnest(col_name for col_name,dtype in df.schema.items() if dtype==pl.Struct)
Alternative method that works for lazyframes
The previous method requires repeatedly setting df. While that does NOT mean it is repeatedly copying any of the underlying data it does mean it wouldn't work on lazyframes without materializing. Because the schema contains the entire nested structure we can use the schema to generate expressions that are lazyframe friendly. There is also something satisfying in not having to reassign the df even knowing that it doesn't produce copies in the process.
def unnest_all(df):
exprs = []
def parse_struct_field(struct_field, base_expr, base_name):
ret_list = []
for field in struct_field.fields:
if field.dtype == pl.Struct:
ret_list.extend(
parse_struct_field(
field.dtype,
base_expr.struct.field(field.name),
f"{base_name}_{field.name}")
)
else:
ret_list.append(
base_expr.struct.field(field.name)
.alias(f"{base_name}_{field.name}"))
return ret_list
# first_level_structs = [x for x, y in df.schema.items() if y == pl.Struct]
for col in df.columns:
if df.schema[col]==pl.Struct:
exprs.extend(parse_struct_field(df.schema[col], pl.col(col), col))
else:
exprs.append(col)
return df.select(*exprs)
This function works by looping through the schema and creating a list of expressions for each outer column. If the outer column is a schema then it calls the parse_struct_field function which itself loops through the inner fields of the struct creating expressions that extract the inner field as a standalone column. When one of those fields is also a struct it just calls itself using recursion to deal with any level of nesting. To deal with name conflicts it keeps track of all the outer column names and prepends that to each inner column so you end up with column names like "abcd_ab_a".
This method will fail for lazyframes that produce a struct opaquely. By that I mean methods like .list.to_struct or str.extract_groups where it can't produce a schema without materializing.
unnestinternally parallelized when giving it a list of columns?