0

This is a follow up to a question that previously answered.

Have a large dataframe df that looks like this (list in column 'SKU')

| SKU                                                                  | Count | Percent     
|----------------------------------------------------------------------|-------|-------------|
| "('000000009100000749',)"                                            | 110   | 0.029633621 |
| "('000000009100000749', '000000009100000776')"                       | 1     | 0.000269397 |
| "('000000009100000749', '000000009100000776', '000000009100002260')" | 1     | 0.000269397 |
| "('000000009100000749', '000000009100000777', '000000009100002260')" | 1     | 0.000269397 |
| "('000000009100000749', '000000009100000777', '000000009100002530')" | 1     | 0.000269397 |

Need to replace the values in the 'SKU' column with corresponding values from a dictionary df_unique that looks like this (please ignore format below, it is a dict):

skus str code i64
000000009100000749 1
000000009100000785 2
000000009100002088 3

I have tried this code:

replacements = pl.col("SKU")

for old, new in df_unique.items():
    replacements = replacements.str.replace_all(old, new)
df = df.select(replacements)

Get this error: SchemaError: Series of dtype: List(Utf8) != Utf8

I have tried to change the column values to string, alhtough I think it is redundant, but same error

df= df.with_column(
    pl.col('SKU').apply(lambda row: [str(x) for x in row])
    )

Any guidance on what I am doing wrong?

3 Answers 3

2

It would help if you showed the actual list type of the column:

It looks like you have "stringified" tuples but it's not entirely clear.

df = pl.DataFrame({
   "SKU": [["000000009100000749"], ["000000009100000749", "000000009100000776"]]
})

sku_to_code = {
    "000000009100000749": 1,
    "000000009100000785": 2,
    "000000009100002088": 3
}
>>> df
shape: (2, 1)
┌─────────────────────────────────────┐
│ SKU                                 │
│ ---                                 │
│ list[str]                           │
╞═════════════════════════════════════╡
│ ["000000009100000749"]              │
│ ["000000009100000749", "00000000... │
└─────────────────────────────────────┘

.list.eval() allows you to run expressions on lists.

pl.element() can be used to refer to the list inside list.eval

replace_sku = pl.element()
for old, new in df_unique.items():
    replace_sku = replace_sku.str.replace_all(old, str(new), literal=True)
df.select(pl.col("SKU").list.eval(replace_sku))
shape: (2, 1)
┌─────────────────────────────┐
│ SKU                         │
│ ---                         │
│ list[str]                   │
╞═════════════════════════════╡
│ ["1"]                       │
│ ["1", "000000009100000776"] │
└─────────────────────────────┘
Sign up to request clarification or add additional context in comments.

Comments

1

Both solutions from jqurious and glebcom above work perfectly for the asked question.

I had not realized that df_unique is a list of dictionaries and not a dict and had to tweak the solution according. Here is the slightly modified solution from jqurious looks like (change the loop to iterate over the elements in the df_unique list of dicts):

replace_sku = pl.element()
for item in df_unique:
    old = item['SKU']
    new = item['code']
    replace_sku = replace_sku.str.replace_all(old, str(new), literal=True)

df = df.select(pl.col("SKU").list.eval(replace_sku, parallel=True))

1 Comment

If df_unique has many rows, it is possible to overflow the stack though due to too large of an expression tree.
1

Column SKU has list[str] dtype, but next you calling attribute .str (here: replacements.str.replace_all(old, new)) which is for string. You should use attribute .list with columns that have list dtype and corresponding methods.

You can use sol-n below with .map_elements() or use sol-n by jqurious which works much faster (because .list.eval() allows to run all expression parallel)

d = {"000000009100000749": 1, "000000009100000776": 2}
df = pl.DataFrame({
    "SKU": [["000000009100000749", "000000009100000776"]]
})
    

df = df.with_columns(
    pl.col("SKU").map_elements(
        lambda row: [d[i] for i in row]
    ).alias("SKU_replaced")
)

3 Comments

Thanks. This would work but I am stuck with the original col("SKU") in format list[str]. I am sure it is something trivial but unable to convert to string dtype. Last hurdle and help appreciated in advance.
Okay, I fixed code. You have error cause you call .str attribute instead of .arr (because your column has list type)
Thank you both! Took me a while to realize that df_unique is actually a list of dictionaries, instead of a dictionary. Needed slight tweak and I have added to the answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.