2

I have a large three-column dataframe of this form:

Ref    Colourref      Shaperef      
5      red 12         square 15
9      14 blue        (circle14,2)  
10     6 orange 12    18 square
12     pink1,7        [oval] [40]
14     [green]        (rectsq#12,6)
...

And a long list with entries like this:

li = [
    'oval 60 [oval] [40]', 
    '(circle14,2) circ', 
    'square 20', 
    '126 18 square 921#',
]

I want to replace the entries in the Shaperef column of the df with a value from the list if the full Shaperef string matches any part of any list item. If there is no match, the entry is not changed.

Desired output:

Ref    Colourref      Shaperef      
5      red 12         square 15
9      14 blue        (circle14,2) circ  
10     6 orange 12    126 18 square 921#
12     pink1,7        oval 60 [oval] [40]
14     [green]        (rectsq#12,6)
...

So refs 9, 10, 12 are updated as there is a partial match with a list item. Refs 5, 14 stay as there are.

1 Answer 1

1

If Shaperef and all the entries in li are all strings you can write a function to apply over Shaperef to convert them:

def f(row_val, seq):
    for item in seq:
        if row_val in item:
            return item
    return row_val

Then:

# read in your example
import pandas as pd
from io import StringIO

s = """Ref    Colourref      Shaperef      
5      red 12         square 15
9      14 blue        (circle14,2)  
10     6 orange 12    18 square
12     pink1,7        [oval] [40]
14     [green]        (rectsq#12,6)
"""
li = [
    "oval 60 [oval] [40]",
    "(circle14,2) circ",
    "square 20",
    "126 18 square 921#",
]
df = pd.read_csv(StringIO(s), sep=r"\s\s+", engine="python")

# Apply the function here:
df["Shaperef"] = df["Shaperef"].apply(lambda v: f(v, li))
#    Ref    Colourref             Shaperef
# 0    5       red 12            square 15
# 1    9      14 blue    (circle14,2) circ
# 2   10  6 orange 12   126 18 square 921#
# 3   12      pink1,7  oval 60 [oval] [40]
# 4   14      [green]        (rectsq#12,6)

This might not be a very quick way of doing this as it has a worst case run time of len(df) * len(li).

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks. This words great with my example data, but when applied to the actual df I get this: "initial_value must be str or None, not DataFrame". Any idea what might be causing that?
That's a StringIO error. What is the line that's causing this error?
df2 = pd.read_csv(StringIO(df), sep=r'\s\s+', engine='python') where df is the original df table
You can't create a new DataFrame like that. That part of the example was only for reading in your example data. The only lines you should need are df["Shaperef"] = df["Shaperef"].apply(lambda v: f(v, li)) and the function f
The final line of the function f, return row_val just says if there wasn't a match, don't change it. You can change that to be return row_val + ", !!NO MATCH!!" and that should do it.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.