Probably the least intrusive way to do it would be to simply use sql.
Alternative 1 (SQL)
pl.sql("""
SELECT a.building_id, a.height, a.location, b.year_built
FROM df1 a
LEFT JOIN df2 b on a.building_id=b.building_id""").collect()
shape: (3, 4)
┌─────────────┬────────┬──────────┬────────────┐
│ building_id ┆ height ┆ location ┆ year_built │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ i64 │
╞═════════════╪════════╪══════════╪════════════╡
│ 1 ┆ 10 ┆ A ┆ null │
│ 2 ┆ 20 ┆ B ┆ 2000 │
│ 3 ┆ 30 ┆ C ┆ 2010 │
└─────────────┴────────┴──────────┴────────────┘
Alternative 2 (hacky stuff that makes the non-working code into working code)
def alias(df: pl.DataFrame, prepend: str) -> pl.DataFrame:
return df.with_columns(**{f"{prepend}.{old}": old for old in df.columns})
orig_select = pl.DataFrame.select
def my_select(df: pl.DataFrame, *exprs, **more_exprs) -> pl.DataFrame:
res = orig_select(df, *exprs, **more_exprs)
res.columns = [col.split(".", maxsplit=1)[-1] for col in res.columns]
return res
pl.DataFrame.alias = alias
pl.DataFrame.select = my_select
df1.alias("a").join(df2.alias("b"), on="building_id", how="left").select(
"a.building_id", "a.height", "a.location", "b.year_built"
)
shape: (3, 4)
┌─────────────┬────────┬──────────┬────────────┐
│ building_id ┆ height ┆ location ┆ year_built │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ i64 │
╞═════════════╪════════╪══════════╪════════════╡
│ 1 ┆ 10 ┆ A ┆ null │
│ 2 ┆ 20 ┆ B ┆ 2000 │
│ 3 ┆ 30 ┆ C ┆ 2010 │
└─────────────┴────────┴──────────┴────────────┘
We create a function called alias which will make a new column with your alias prepended to all the columns. My first instinct was to rename them but your join uses the original name not the new names so I needed to keep the original name available. Of course, we monkey patch that to pl.DataFrame so you can use it from your df. BUT WAIT, there's more, we monkey patch our own select method into pl.DataFrame which uses the real select but it also renames the columns to take away the "a." and "b." (or "*.").
Alternative 3 (not sql, not too hacky)
Assuming you don't like sql and don't like hacks, another idea would be to make your own join wrapper that does all of the above steps in one function something like this:
from typing import Sequence
from polars._typing import JoinStrategy, JoinValidation, MaintainOrderJoin
def cust_join(
df1_alias: tuple[pl.DataFrame, str],
df2_alias: tuple[pl.DataFrame, str],
*,
on: str | pl.Expr | Sequence[str | pl.Expr] | None = None,
how: JoinStrategy = "inner",
left_on: str | pl.Expr | Sequence[str | pl.Expr] | None = None,
right_on: str | pl.Expr | Sequence[str | pl.Expr] | None = None,
suffix: str = "_right",
validate: JoinValidation = "m:m",
nulls_equal: bool = False,
coalesce: bool | None = None,
maintain_order: MaintainOrderJoin | None = None,
select: list[pl.Expr | str],
):
df1 = df1_alias[0].with_columns(
**{f"{df1_alias[1]}.{old}": old for old in df1_alias[0].columns}
)
df2 = df2_alias[0].with_columns(
**{f"{df2_alias[1]}.{old}": old for old in df2_alias[0].columns}
)
joined = df1.join(
df2,
on=on,
how=how,
left_on=left_on,
right_on=right_on,
suffix=suffix,
validate=validate,
nulls_equal=nulls_equal,
coalesce=coalesce,
maintain_order=maintain_order
)
res = joined.select(*select)
res.columns = [col.split(".", maxsplit=1)[-1] for col in res.columns]
return res
cust_join(
(df1,"a"),
(df2,"b"),
on="building_id",
how="left",
select=["a.building_id", "a.height", "a.location", "b.year_built"],
)
shape: (3, 4)
┌─────────────┬────────┬──────────┬────────────┐
│ building_id ┆ height ┆ location ┆ year_built │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ i64 │
╞═════════════╪════════╪══════════╪════════════╡
│ 1 ┆ 10 ┆ A ┆ null │
│ 2 ┆ 20 ┆ B ┆ 2000 │
│ 3 ┆ 30 ┆ C ┆ 2010 │
└─────────────┴────────┴──────────┴────────────┘
Most of the signature is just copy-paste from the real join source code. I made the df1 and df2 inputs a tuple with the alias so that the autoformatter doesn't put the alias on its own line.