3

I have this example dataframe in polars:

import polars as pl

df_example = pl.DataFrame(
    {
        "DATE": ["2024-11-11", "2024-11-11", "2024-11-12", "2024-11-12", "2024-11-13"],
        "A": [None, None, "option1", "option2", None],
        "B": [None, None, "YES", "YES", "NO"],
    }
)

Which looks like this:

shape: (5, 3)
┌────────────┬─────────┬──────┐
│ DATE       ┆ A       ┆ B    │
│ ---        ┆ ---     ┆ ---  │
│ str        ┆ str     ┆ str  │
╞════════════╪═════════╪══════╡
│ 2024-11-11 ┆ null    ┆ null │
│ 2024-11-11 ┆ null    ┆ null │
│ 2024-11-12 ┆ option1 ┆ YES  │
│ 2024-11-12 ┆ option2 ┆ YES  │
│ 2024-11-13 ┆ null    ┆ NO   │
└────────────┴─────────┴──────┘

As you can see this is a long format dataframe. I want to have it in a wide format, meaning that I want the DATE to be unique per row and for each other column several columns have to be created. What I want to achieve is:

shape: (3, 5)
┌────────────┬───────────┬───────────┬───────┬──────┐
│ DATE       ┆ A_option1 ┆ A_option2 ┆ B_YES ┆ B_NO │
│ ---        ┆ ---       ┆ ---       ┆ ---   ┆ ---  │
│ str        ┆ bool      ┆ bool      ┆ bool  ┆ bool │
╞════════════╪═══════════╪═══════════╪═══════╪══════╡
│ 2024-11-11 ┆ null      ┆ null      ┆ null  ┆ null │
│ 2024-11-12 ┆ true      ┆ true      ┆ true  ┆ null │
│ 2024-11-13 ┆ null      ┆ null      ┆ null  ┆ true │
└────────────┴───────────┴───────────┴───────┴──────┘

I have tried doing the following:

df_example.pivot(
    index="DATE", on=["A", "B"], values=["A", "B"], aggregate_function="first"
)

However, I get this error:

# DuplicateError: column with name 'null' has more than one occurrence

Which is logical, as it tries to create a column for the Null values in columns A, and a column for the Null values in column B.

I am looking for a clean solution to this problem. I know I can impute the nulls per column with something unique and then do the pivot. Or by pivoting per column and then dropping the Null columns. However, this will create unnecessary columns. I want something more elegant.

2 Answers 2

1

I ended up with:

(
    df_example.pipe(
        lambda df: df.group_by("DATE").agg(
            [
                pl.col(col).eq(val).any().alias(f"{col}_{val}")
                for col in df.select(pl.exclude("DATE")).columns
                for val in df.get_column(col).unique().drop_nulls()
            ]
        )
    ).sort("DATE")
)
Sign up to request clarification or add additional context in comments.

1 Comment

Hey Oliver, is this the best version of this solution so far? It still seems a bit clunky tbh.
1

Do this for each column separately and then

pl.concat([
    df_example
    .pivot(
        index="DATE", values=col, on=col, 
        aggregate_function=pl.element().drop_nulls().is_not_null().first()
    )
    .drop('null')
    .select(
        pl.col("DATE"),
        pl.exclude("DATE").name.prefix(f"{col}_")
    )
    for col in df_example.columns if col != "DATE"
], how="align")
shape: (3, 5)
┌────────────┬───────────┬───────────┬───────┬──────┐
│ DATE       ┆ A_option1 ┆ A_option2 ┆ B_YES ┆ B_NO │
│ ---        ┆ ---       ┆ ---       ┆ ---   ┆ ---  │
│ str        ┆ bool      ┆ bool      ┆ bool  ┆ bool │
╞════════════╪═══════════╪═══════════╪═══════╪══════╡
│ 2024-11-11 ┆ null      ┆ null      ┆ null  ┆ null │
│ 2024-11-12 ┆ true      ┆ true      ┆ true  ┆ null │
│ 2024-11-13 ┆ null      ┆ null      ┆ null  ┆ true │
└────────────┴───────────┴───────────┴───────┴──────┘

alternatively

(
    df_example
    .unpivot(index="DATE")
    .select(pl.col.DATE, value = pl.col.variable + "_" + pl.col.value)
    .unique(maintain_order=True)
    .pivot(
        index="DATE", on="value", values="value",
        aggregate_function=pl.element().is_not_null().any()
    )
    .drop("null")
)
shape: (3, 5)
┌────────────┬───────────┬───────────┬───────┬──────┐
│ DATE       ┆ A_option1 ┆ A_option2 ┆ B_YES ┆ B_NO │
│ ---        ┆ ---       ┆ ---       ┆ ---   ┆ ---  │
│ str        ┆ bool      ┆ bool      ┆ bool  ┆ bool │
╞════════════╪═══════════╪═══════════╪═══════╪══════╡
│ 2024-11-11 ┆ null      ┆ null      ┆ null  ┆ null │
│ 2024-11-12 ┆ true      ┆ true      ┆ true  ┆ null │
│ 2024-11-13 ┆ null      ┆ null      ┆ null  ┆ true │
└────────────┴───────────┴───────────┴───────┴──────┘

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.