Select all columns where column name starts with string

Question

Given the following dataframe, is there some way to select only columns starting with a given prefix? I know I could do e.g. pl.col(column) for column in df.columns if column.startswith("prefix_"), but I'm wondering if I can do it as part of a single expression.

df = pl.DataFrame(
    {"prefix_a": [1, 2, 3], "prefix_b": [1, 2, 3], "some_column": [3, 2, 1]}
)
df.select(pl.all().<column_name_starts_with>("prefix_"))

Would this be possible to do lazily?

jqurious · Accepted Answer · 2024-09-16 18:53:31Z

Starting from Polars 0.18.1 you can use Selectors(polars.selectors.starts_with) which provides more intuitive selection of columns from DataFrame or LazyFrame objects based on their name, dtype or other properties.

>>> import polars as pl
>>> import polars.selectors as cs
>>> 
>>> df = pl.DataFrame(
...     {"prefix_a": [1, 2, 3], "prefix_b": [1, 2, 3], "some_column": [3, 2, 1]} 
... )
>>> df
shape: (3, 3)
┌──────────┬──────────┬─────────────┐
│ prefix_a ┆ prefix_b ┆ some_column │
│ ---      ┆ ---      ┆ ---         │
│ i64      ┆ i64      ┆ i64         │
╞══════════╪══════════╪═════════════╡
│ 1        ┆ 1        ┆ 3           │
│ 2        ┆ 2        ┆ 2           │
│ 3        ┆ 3        ┆ 1           │
└──────────┴──────────┴─────────────┘
>>> # print(df.lazy().select(cs.starts_with("prefix_")).collect()) # for LazyFrame
>>> print(df.select(cs.starts_with("prefix_"))) # For DataFrame
shape: (3, 2)
┌──────────┬──────────┐
│ prefix_a ┆ prefix_b │
│ ---      ┆ ---      │
│ i64      ┆ i64      │
╞══════════╪══════════╡
│ 1        ┆ 1        │
│ 2        ┆ 2        │
│ 3        ┆ 3        │
└──────────┴──────────┘

jqurious · Accepted Answer · 2024-09-16 18:53:28Z

From the documentation for polars.col, the expression can take one of the following arguments:

a single column by name

all columns by using a wildcard “*”

column by regular expression if the regex starts with ^ and ends with $

So in this case, we can use a regex expression to select for the prefix. And this does work in lazy mode.

(
    df
    .lazy()
    .select(pl.col('^prefix_.*$'))
    .collect()
)

shape: (3, 2)
┌──────────┬──────────┐
│ prefix_a ┆ prefix_b │
│ ---      ┆ ---      │
│ i64      ┆ i64      │
╞══════════╪══════════╡
│ 1        ┆ 1        │
│ 2        ┆ 2        │
│ 3        ┆ 3        │
└──────────┴──────────┘

Note: we can also use polars.exclude with regex expressions:

(
    df
    .lazy()
    .select(pl.exclude('^prefix_.*$'))
    .collect()
)

shape: (3, 1)
┌─────────────┐
│ some_column │
│ ---         │
│ i64         │
╞═════════════╡
│ 3           │
│ 2           │
│ 1           │
└─────────────┘

user459872 · Accepted Answer · 2024-05-03 08:58:46Z

You can also use polars.selectors.matches with the pattern ^prefix_.

>>> import polars as pl
>>> import polars.selectors as cs
>>> 
>>> df = pl.DataFrame(
...     {"prefix_a": [1, 2, 3], "prefix_b": [1, 2, 3], "some_column": [3, 2, 1]}
... )
>>> 
>>> df
shape: (3, 3)
┌──────────┬──────────┬─────────────┐
│ prefix_a ┆ prefix_b ┆ some_column │
│ ---      ┆ ---      ┆ ---         │
│ i64      ┆ i64      ┆ i64         │
╞══════════╪══════════╪═════════════╡
│ 1        ┆ 1        ┆ 3           │
│ 2        ┆ 2        ┆ 2           │
│ 3        ┆ 3        ┆ 1           │
└──────────┴──────────┴─────────────┘
>>> 
>>> df.lazy().select(cs.matches("^prefix_")).collect()
shape: (3, 2)
┌──────────┬──────────┐
│ prefix_a ┆ prefix_b │
│ ---      ┆ ---      │
│ i64      ┆ i64      │
╞══════════╪══════════╡
│ 1        ┆ 1        │
│ 2        ┆ 2        │
│ 3        ┆ 3        │
└──────────┴──────────┘

Collectives™ on Stack Overflow

Select all columns where column name starts with string

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related