1

If I have a DataFrame, I can create a column with a single value like this:

df = pl.DataFrame([[1, 2, 3]])
df.with_columns(pl.lit("ok").alias("metadata"))
shape: (3, 2)
┌──────────┬──────────┐
│ column_0 ┆ metadata │
│ ---      ┆ ---      │
│ i64      ┆ str      │
╞══════════╪══════════╡
│ 1        ┆ ok       │
│ 2        ┆ ok       │
│ 3        ┆ ok       │
└──────────┴──────────┘

but with pl.Object columns, it does not work:

df = pl.DataFrame([[1, 2, 3]])
df.with_columns(pl.lit("ok", dtype=pl.Object).alias("metadata"))
# InvalidOperationError: casting from Utf8View to FixedSizeBinary(8) not supported

using one-element pl.Series does not work either:

df.with_columns(pl.Series(["ok"], dtype=pl.Object).alias("metadata"))
# InvalidOperationError: Series metadata, length 1 doesn't 
# match the DataFrame height of 3
# If you want expression: Series[metadata] to be broadcasted, 
# ensure it is a scalar (for instance by adding '.first()').

It seems that I need either to create a pl.Series of correct length manually (like pl.Series(["ok"] * df.height, dtype=pl.Object), or do a cross-join like this:

df.join(pl.Series(["ok"], dtype=pl.Object).to_frame("metadata"), how="cross")

It works, but is not very elegant. Are there any better solutions?

NB. I used a string object just as an example. I really need pl.Object column to store various heterogeneous data, not strings, and cannot use, say, pl.Struct instead.

2 Answers 2

3

I'm not sure why polars specifically doesn't like String (Utf8View), but this works fine if you use other Object instances which are already a scalar

>>> df = pl.DataFrame([[1, 2, 3]])
... df.with_columns(pl.lit(None, dtype=pl.Object).alias("metadata"))
...
shape: (3, 2)
┌──────────┬──────────┐
│ column_0 ┆ metadata │
│ ---      ┆ ---      │
│ i64      ┆ object   │
╞══════════╪══════════╡
│ 1        ┆ null     │
│ 2        ┆ null     │
│ 3        ┆ null     │
└──────────┴──────────┘
>>> df.with_columns(pl.lit(None).cast(pl.Object).alias("metadata"))
...
shape: (3, 2)
┌──────────┬──────────┐
│ column_0 ┆ metadata │
│ ---      ┆ ---      │
│ i64      ┆ object   │
╞══════════╪══════════╡
│ 1        ┆ null     │
│ 2        ┆ null     │
│ 3        ┆ null     │
└──────────┴──────────┘

Or follow the suggestion to use .first() for a custom object, forcing it to be a scalar

>>> class Foo: pass
...
>>> df = pl.DataFrame([[1, 2, 3]])
... df.with_columns(pl.lit(Foo(), allow_object=True).first().alias("metadata"))
...
shape: (3, 2)
┌──────────┬─────────────────────────────────┐
│ column_0 ┆ metadata                        │
│ ---      ┆ ---                             │
│ i64      ┆ object                          │
╞══════════╪═════════════════════════════════╡
│ 1        ┆ <__main__.Foo object at 0xffff… │
│ 2        ┆ <__main__.Foo object at 0xffff… │
│ 3        ┆ <__main__.Foo object at 0xffff… │
└──────────┴─────────────────────────────────┘

Interestingly, this appears to reveal a bug: successive calls without allow_object=True don't raise, even though neither the original instance of df nor the library-level pl.lit feel like they should have been modified by .with_columns() and fresh instances should always raise in the same way to be Pythonic

>>> df = pl.DataFrame([[1, 2, 3]])
... df.with_columns(pl.lit(Foo()).first().alias("metadata"))
...
shape: (3, 2)
┌──────────┬─────────────────────────────────┐
│ column_0 ┆ metadata                        │
│ ---      ┆ ---                             │
│ i64      ┆ object                          │
╞══════════╪═════════════════════════════════╡
│ 1        ┆ <__main__.Foo object at 0xffff… │
│ 2        ┆ <__main__.Foo object at 0xffff… │
│ 3        ┆ <__main__.Foo object at 0xffff… │
└──────────┴─────────────────────────────────┘
>>> pl.__version__
'1.34.0'
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks! Interesting: Foo() works, but df.with_columns(pl.lit({"a": "b"}, dtype=pl.Object).first().alias("metadata")) raises the same error casting from Utf8View to FixedSizeBinary(8) not supported
Re: a bug: the docs say that it will raise If type is unknown, but I don't understand what does it specifically mean. Whether a type of Foo is unknown in this case?
I suppose I'm less convinced it's a bug, but it's surprising to me that successive calls with the same args will or won't raise: this almost-certainly means pl.lit() maintains a collection of acceptable object types, though I don't see that as a feature when not explicitly done!
2

You can wrap the Series in pl.lit() and use .first() on that.

pl.lit(pl.Series(..., dtype=pl.Object)).first()

Tested with both your string and dict examples:

for data in ["ok", {"a": "b"}]:
    out = df.with_columns(
        pl.lit(pl.Series([data], dtype=pl.Object)).first().alias("metadata")
    )
    print(out)
shape: (3, 2)
┌──────────┬──────────┐
│ column_0 ┆ metadata │
│ ---      ┆ ---      │
│ i64      ┆ object   │
╞══════════╪══════════╡
│ 1        ┆ ok       │
│ 2        ┆ ok       │
│ 3        ┆ ok       │
└──────────┴──────────┘
shape: (3, 2)
┌──────────┬────────────┐
│ column_0 ┆ metadata   │
│ ---      ┆ ---        │
│ i64      ┆ object     │
╞══════════╪════════════╡
│ 1        ┆ {'a': 'b'} │
│ 2        ┆ {'a': 'b'} │
│ 3        ┆ {'a': 'b'} │
└──────────┴────────────┘

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.