2

I am working to migrate from PySpark to Polars. In PySpark I often use aliases on dataframes so I can clearly see which columns come from which side of a join. I'd like to get similarly readable code in Polars. Conceptually, I want something like this (non-working) code.

import polars as pl

df1 = pl.DataFrame({
    "building_id": [1, 2, 3],
    "height": [10, 20, 30],
    "location": ["A", "B", "C"],
})

df2 = pl.DataFrame({
    "building_id": [2, 3, 4],
    "depth": [25, 35, 45],
    "year_built": [2000, 2010, 2020],
})


df1.alias("a").join(df2.alias("b"), on="building_id", how="left") \
   .select(
        "a.building_id", 
        "a.height", 
        "a.location", 
        "b.year_built"
    )

Does anybody know good options for this? My motivation for this is that it becomes harder to track which columns come from which dataframe when having many columns, or when it's already on a resulting dataframe from other transformations.

I tried the following options:

  1. Add suffixes (i.e. tag all non-key columns from df2 with _df2. I don't like this, since the code wouldn't be so clean.
  2. Put columns in structs, but it becomes even more messy.

1 Answer 1

3

Probably the least intrusive way to do it would be to simply use sql.

Alternative 1 (SQL)

pl.sql("""
       SELECT a.building_id, a.height, a.location, b.year_built
       FROM df1 a
       LEFT JOIN df2 b on a.building_id=b.building_id""").collect()
shape: (3, 4)
┌─────────────┬────────┬──────────┬────────────┐
│ building_id ┆ height ┆ location ┆ year_built │
│ ---         ┆ ---    ┆ ---      ┆ ---        │
│ i64         ┆ i64    ┆ str      ┆ i64        │
╞═════════════╪════════╪══════════╪════════════╡
│ 1           ┆ 10     ┆ A        ┆ null       │
│ 2           ┆ 20     ┆ B        ┆ 2000       │
│ 3           ┆ 30     ┆ C        ┆ 2010       │
└─────────────┴────────┴──────────┴────────────┘

Alternative 2 (hacky stuff that makes the non-working code into working code)

def alias(df: pl.DataFrame, prepend: str) -> pl.DataFrame:
    return df.with_columns(**{f"{prepend}.{old}": old for old in df.columns})


orig_select = pl.DataFrame.select


def my_select(df: pl.DataFrame, *exprs, **more_exprs) -> pl.DataFrame:
    res = orig_select(df, *exprs, **more_exprs)
    res.columns = [col.split(".", maxsplit=1)[-1] for col in res.columns]
    return res


pl.DataFrame.alias = alias
pl.DataFrame.select = my_select

df1.alias("a").join(df2.alias("b"), on="building_id", how="left").select(
    "a.building_id", "a.height", "a.location", "b.year_built"
)

shape: (3, 4)
┌─────────────┬────────┬──────────┬────────────┐
│ building_id ┆ height ┆ location ┆ year_built │
│ ---         ┆ ---    ┆ ---      ┆ ---        │
│ i64         ┆ i64    ┆ str      ┆ i64        │
╞═════════════╪════════╪══════════╪════════════╡
│ 1           ┆ 10     ┆ A        ┆ null       │
│ 2           ┆ 20     ┆ B        ┆ 2000       │
│ 3           ┆ 30     ┆ C        ┆ 2010       │
└─────────────┴────────┴──────────┴────────────┘

We create a function called alias which will make a new column with your alias prepended to all the columns. My first instinct was to rename them but your join uses the original name not the new names so I needed to keep the original name available. Of course, we monkey patch that to pl.DataFrame so you can use it from your df. BUT WAIT, there's more, we monkey patch our own select method into pl.DataFrame which uses the real select but it also renames the columns to take away the "a." and "b." (or "*.").

Alternative 3 (not sql, not too hacky)

Assuming you don't like sql and don't like hacks, another idea would be to make your own join wrapper that does all of the above steps in one function something like this:

from typing import Sequence
from polars._typing import JoinStrategy, JoinValidation, MaintainOrderJoin


def cust_join(
    df1_alias: tuple[pl.DataFrame, str],
    df2_alias: tuple[pl.DataFrame, str],
    *,
    on: str | pl.Expr | Sequence[str | pl.Expr] | None = None,
    how: JoinStrategy = "inner",
    left_on: str | pl.Expr | Sequence[str | pl.Expr] | None = None,
    right_on: str | pl.Expr | Sequence[str | pl.Expr] | None = None,
    suffix: str = "_right",
    validate: JoinValidation = "m:m",
    nulls_equal: bool = False,
    coalesce: bool | None = None,
    maintain_order: MaintainOrderJoin | None = None,
    select: list[pl.Expr | str],
):
    df1 = df1_alias[0].with_columns(
        **{f"{df1_alias[1]}.{old}": old for old in df1_alias[0].columns}
    )
    df2 = df2_alias[0].with_columns(
        **{f"{df2_alias[1]}.{old}": old for old in df2_alias[0].columns}
    )
    joined = df1.join(
        df2,
        on=on,
        how=how,
        left_on=left_on,
        right_on=right_on,
        suffix=suffix,
        validate=validate,
        nulls_equal=nulls_equal,
        coalesce=coalesce,
        maintain_order=maintain_order
    )
    res = joined.select(*select)
    res.columns = [col.split(".", maxsplit=1)[-1] for col in res.columns]
    return res



cust_join(
    (df1,"a"),
    (df2,"b"),
    on="building_id",
    how="left",
    select=["a.building_id", "a.height", "a.location", "b.year_built"],
)

shape: (3, 4)
┌─────────────┬────────┬──────────┬────────────┐
│ building_id ┆ height ┆ location ┆ year_built │
│ ---         ┆ ---    ┆ ---      ┆ ---        │
│ i64         ┆ i64    ┆ str      ┆ i64        │
╞═════════════╪════════╪══════════╪════════════╡
│ 1           ┆ 10     ┆ A        ┆ null       │
│ 2           ┆ 20     ┆ B        ┆ 2000       │
│ 3           ┆ 30     ┆ C        ┆ 2010       │
└─────────────┴────────┴──────────┴────────────┘

Most of the signature is just copy-paste from the real join source code. I made the df1 and df2 inputs a tuple with the alias so that the autoformatter doesn't put the alias on its own line.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.