Removing Strings from a Pandas DataFrame Column

Question

I have a pandas dataframe as shown below.

DF1 =

sid                 path
 1    '["rome","is","in","province","lazio"]'   
 1    "['rome', 'is', 'in', 'province', 'naples']"
 1     ['N']
 1    "['rome', 'is', 'in', 'province', 'in', 'campania']"
 ....

I want to remove all unnecessary characters of the column path so the result should look like this:

DF2 =

    sid                  path
     1         rome is in province lazio
     1         rome is in province naples
     1                    N
     1         rome is in province in campania
 ....

I tried replacing all the unnecessary characters like this :

 DF1["path"].replace("[","").replace("]","").replace('"',"").replace(","," ").replace("'","")

But it didn't work. I suppose it's due to the entries ["N"]

How can I do this? Any help is appreciated!

Why is ['N'] not quoted? Is it a list containing a string or is it supposed to be "['N']"? — hilberts_drinking_problem
– hilberts_drinking_problem, Commented Jun 18, 2018 at 15:02

Rakesh · Accepted Answer · 2018-06-18 15:10:40Z

1

Using ast.literal_eval & str.join

Demo:

import pandas as pd
import ast
df = pd.DataFrame({"path": ['["rome","is","in","province","lazio"]', "['rome', 'is', 'in', 'province', 'naples']", ['N']]})
df['path'] = df['path'].astype(str).apply(ast.literal_eval).apply(lambda x: " ".join(x))
print(df)

Output:

                         path
0   rome is in province lazio
1  rome is in province naples
2                           N

edited Jun 18, 2018 at 15:10

answered Jun 18, 2018 at 15:06

Rakesh

82.9k17 gold badges85 silver badges122 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jpp Over a year ago

Yup that might work, though a bit of a roundabout way since you apply list -> str -> list.

jpp · Accepted Answer · 2018-06-18 15:05:27Z

1

You can use ast.literal_eval to safely read lists output as strings. One way to account for genuine lists is to catch ValueError.

Note that, if at all possible, you should try to sort these issues upstream before they reach your dataframe.

from ast import literal_eval

df = pd.DataFrame({'sid': [1, 1, 1, 1],
                   'path': ['["rome","is","in","province","lazio"]',
                            "['rome', 'is', 'in', 'province', 'naples']",
                            ['N'],
                            "['rome', 'is', 'in', 'province', 'in', 'campania']"]})

def converter(x):
    try:
        return ' '.join(literal_eval(x))
    except ValueError:
        return ' '.join(x)

df['path'] = df['path'].apply(converter)

print(df)

                              path  sid
0        rome is in province lazio    1
1       rome is in province naples    1
2                                N    1
3  rome is in province in campania    1

edited Jun 18, 2018 at 15:05

answered Jun 18, 2018 at 15:04

jpp

166k37 gold badges301 silver badges363 bronze badges

2 Comments

Bubble Bubble Bubble Gut Over a year ago

Is there any difference between ast.literal_eval and plain eval?

jpp Over a year ago

@BubbleBubbleBubbleGut, Yes, eval is unsafe and not recommended.

Collectives™ on Stack Overflow

Removing Strings from a Pandas DataFrame Column

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related