17

I have a polars DataFrame with a number of Series that look like:

pl.Series(['cow', 'cat', '', 'lobster', ''])
# Series: '' [str]
# [
#   "cow"
#   "cat"
#   ""
#   "lobster"
#   ""
# ]

and I'd like them to be

pl.Series(['cow', 'cat', None, 'lobster', None])
# Series: '' [str]
# [
#   "cow"
#   "cat"
#   null
#   "lobster"
#   null
# ]

A simple string replacement gives me an error:

pl.Series(['cow', 'cat', '', 'lobster', '']).str.replace('', None)

ComputeError: value cannot be 'null' in 'replace' expression

What's the idiomatic way of doing this for a Series/DataFrame in polars?

1
  • 1
    Note that pl.Null is a DataType. If you want to set a null value, you should use pythons builtin None. Commented May 23, 2022 at 6:46

2 Answers 2

24

Series

For a single Series, you can use the set method.

s = pl.Series(["cow", "cat", "", "lobster", ""])
s.set(s.str.len_chars() == 0, None)
shape: (5,)
Series: '' [str]
[
        "cow"
        "cat"
        null
        "lobster"
        null
]

DataFrame

For DataFrames, I would suggest using when/then/otherwise. For example, with this data:

df = pl.DataFrame({
    "str1": ["cow", "dog", "", "lobster", ""],
    "str2": ["", "apple", "orange", "", "kiwi"],
    "str3": ["house", "", "apartment", "condo", ""],
})
shape: (5, 3)
┌─────────┬────────┬───────────┐
│ str1    ┆ str2   ┆ str3      │
│ ---     ┆ ---    ┆ ---       │
│ str     ┆ str    ┆ str       │
╞═════════╪════════╪═══════════╡
│ cow     ┆        ┆ house     │
│ dog     ┆ apple  ┆           │
│         ┆ orange ┆ apartment │
│ lobster ┆        ┆ condo     │
│         ┆ kiwi   ┆           │
└─────────┴────────┴───────────┘

We can run a replacement on all string columns as follows:

df.with_columns(
    pl.when(pl.col(pl.String).str.len_chars() == 0)
    .then(None)
    .otherwise(pl.col(pl.String))
    .name.keep()
)
shape: (5, 3)
┌─────────┬────────┬───────────┐
│ str1    ┆ str2   ┆ str3      │
│ ---     ┆ ---    ┆ ---       │
│ str     ┆ str    ┆ str       │
╞═════════╪════════╪═══════════╡
│ cow     ┆ null   ┆ house     │
│ dog     ┆ apple  ┆ null      │
│ null    ┆ orange ┆ apartment │
│ lobster ┆ null   ┆ condo     │
│ null    ┆ kiwi   ┆ null      │
└─────────┴────────┴───────────┘

The above should be fairly performant.

If you only want to replace empty strings with null on certain columns, you can provide a list:

only_these = ["str1", "str2"]
df.with_columns(
    pl.when(pl.col(only_these).str.len_chars() == 0)
    .then(None)
    .otherwise(pl.col(only_these))
    .name.keep()
)
shape: (5, 3)
┌─────────┬────────┬───────────┐
│ str1    ┆ str2   ┆ str3      │
│ ---     ┆ ---    ┆ ---       │
│ str     ┆ str    ┆ str       │
╞═════════╪════════╪═══════════╡
│ cow     ┆ null   ┆ house     │
│ dog     ┆ apple  ┆           │
│ null    ┆ orange ┆ apartment │
│ lobster ┆ null   ┆ condo     │
│ null    ┆ kiwi   ┆           │
└─────────┴────────┴───────────┘
Sign up to request clarification or add additional context in comments.

Comments

8

To replace empty strings in a string series.

series.replace("", None)

To replace empty strings in a specific dataframe column "A".

df.with_columns(pl.col("A").replace("", None))

To replace empty strings in all string columns of a dataframe.

df.with_columns(pl.col(pl.String).replace("", None))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.