Python Polars - how to replace strings in a df column with lists with values from dictionary?

Question

This is a follow up to a question that previously answered.

Have a large dataframe df that looks like this (list in column 'SKU')

| SKU                                                                  | Count | Percent     
|----------------------------------------------------------------------|-------|-------------|
| "('000000009100000749',)"                                            | 110   | 0.029633621 |
| "('000000009100000749', '000000009100000776')"                       | 1     | 0.000269397 |
| "('000000009100000749', '000000009100000776', '000000009100002260')" | 1     | 0.000269397 |
| "('000000009100000749', '000000009100000777', '000000009100002260')" | 1     | 0.000269397 |
| "('000000009100000749', '000000009100000777', '000000009100002530')" | 1     | 0.000269397 |

Need to replace the values in the 'SKU' column with corresponding values from a dictionary df_unique that looks like this (please ignore format below, it is a dict):

skus str	code i64
000000009100000749	1
000000009100000785	2
000000009100002088	3

I have tried this code:

replacements = pl.col("SKU")

for old, new in df_unique.items():
    replacements = replacements.str.replace_all(old, new)
df = df.select(replacements)

Get this error: SchemaError: Series of dtype: List(Utf8) != Utf8

I have tried to change the column values to string, alhtough I think it is redundant, but same error

df= df.with_column(
    pl.col('SKU').apply(lambda row: [str(x) for x in row])
    )

Any guidance on what I am doing wrong?

jqurious · Accepted Answer · 2024-09-24 03:39:19Z

It would help if you showed the actual list type of the column:

It looks like you have "stringified" tuples but it's not entirely clear.

df = pl.DataFrame({
   "SKU": [["000000009100000749"], ["000000009100000749", "000000009100000776"]]
})

sku_to_code = {
    "000000009100000749": 1,
    "000000009100000785": 2,
    "000000009100002088": 3
}

>>> df
shape: (2, 1)
┌─────────────────────────────────────┐
│ SKU                                 │
│ ---                                 │
│ list[str]                           │
╞═════════════════════════════════════╡
│ ["000000009100000749"]              │
│ ["000000009100000749", "00000000... │
└─────────────────────────────────────┘

.list.eval() allows you to run expressions on lists.

pl.element() can be used to refer to the list inside list.eval

replace_sku = pl.element()
for old, new in df_unique.items():
    replace_sku = replace_sku.str.replace_all(old, str(new), literal=True)

df.select(pl.col("SKU").list.eval(replace_sku))

shape: (2, 1)
┌─────────────────────────────┐
│ SKU                         │
│ ---                         │
│ list[str]                   │
╞═════════════════════════════╡
│ ["1"]                       │
│ ["1", "000000009100000776"] │
└─────────────────────────────┘

jqurious · Accepted Answer · 2024-09-24 03:39:48Z

1

Both solutions from jqurious and glebcom above work perfectly for the asked question.

I had not realized that df_unique is a list of dictionaries and not a dict and had to tweak the solution according. Here is the slightly modified solution from jqurious looks like (change the loop to iterate over the elements in the df_unique list of dicts):

replace_sku = pl.element()
for item in df_unique:
    old = item['SKU']
    new = item['code']
    replace_sku = replace_sku.str.replace_all(old, str(new), literal=True)

df = df.select(pl.col("SKU").list.eval(replace_sku, parallel=True))

edited Sep 24, 2024 at 3:39

jqurious

24.2k6 gold badges24 silver badges43 bronze badges

answered Jan 9, 2023 at 5:04

DBOak

1552 silver badges13 bronze badges

1 Comment

ecoe Over a year ago

If df_unique has many rows, it is possible to overflow the stack though due to too large of an expression tree.

jqurious · Accepted Answer · 2024-09-24 03:41:25Z

1

Column SKU has list[str] dtype, but next you calling attribute .str (here: replacements.str.replace_all(old, new)) which is for string. You should use attribute .list with columns that have list dtype and corresponding methods.

You can use sol-n below with .map_elements() or use sol-n by jqurious which works much faster (because .list.eval() allows to run all expression parallel)

d = {"000000009100000749": 1, "000000009100000776": 2}
df = pl.DataFrame({
    "SKU": [["000000009100000749", "000000009100000776"]]
})
    

df = df.with_columns(
    pl.col("SKU").map_elements(
        lambda row: [d[i] for i in row]
    ).alias("SKU_replaced")
)

edited Sep 24, 2024 at 3:41

jqurious

24.2k6 gold badges24 silver badges43 bronze badges

answered Jan 7, 2023 at 20:47

glebcom

1,4729 silver badges16 bronze badges

3 Comments

DBOak Over a year ago

Thanks. This would work but I am stuck with the original col("SKU") in format list[str]. I am sure it is something trivial but unable to convert to string dtype. Last hurdle and help appreciated in advance.

glebcom Over a year ago

Okay, I fixed code. You have error cause you call .str attribute instead of .arr (because your column has list type)

DBOak Over a year ago

Thank you both! Took me a while to realize that df_unique is actually a list of dictionaries, instead of a dictionary. Needed slight tweak and I have added to the answer.

Collectives™ on Stack Overflow

Python Polars - how to replace strings in a df column with lists with values from dictionary?

3 Answers 3

Comments

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related