remove a list of strings from a series of strings

Question

Goal: Remove items from my list, strings_2_remove, from a series. I have a list of strings like so:

strings_2_remove = [
"dogs are so cool",
"cats have cute toe beans"
]

I also have a series of strings that looks like this:

df.Sentences.head()

0    dogs are so cool because they are nice and funny 
1    many people love cats because cats have cute toe beans
2    hamsters are very small and furry creatures
3    i got a dog because i know dogs are so cool because they are nice and funny
4    birds are funny when they dance to music, they bop up and down
Name: Summary, dtype: object

The outcome after removing the strings in the list from the series should look like this:

    0    because they are nice and funny 
    1    many people love cats because 
    2    hamsters are very small and furry creatures
    3    i got a dog because i know because they are nice and funny
    4    birds are funny when they dance to music, they bop up and down
    Name: Summary, dtype: object

I have the following in attempt to achieve the output I want:

mask_1 = (df.Sentences == strings_2_remove)
df.loc[mask_1, 'df.Sentences'] = " "

However, it is not achieving my goal.

Any suggestions?

advance512 · Accepted Answer · 2019-04-17 16:26:14Z

1

Try:

result = df.Sentences
for stringToRemove in strings_2_remove:
    result = result.replace(stringToRemove, '', regex=False)

There are better, more performant solutions using RegEx. More information here.

answered Apr 17, 2019 at 16:26

advance512

1,3789 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Chris Adams · Accepted Answer · 2019-04-17 16:36:30Z

1

Use Series.replace:

df.Sentences.replace('|'.join(strings_2_remove), '', regex=True)

0                      because they are nice and funny
1                       many people love cats because 
2          hamsters are very small and furry creatures
3    i got a dog because i know  because they are n...
4    birds are funny when they dance to music, they...
Name: Sentences, dtype: object

answered Apr 17, 2019 at 16:36

Chris Adams

18.7k4 gold badges26 silver badges44 bronze badges

5 Comments

iamklaus Over a year ago

hey do you know the cause of this unsynchronized strings ? i got the same output

Chris Adams Over a year ago

It's just pandas display settings - default is right-aligned. There is no whitespace padding or anything

iamklaus Over a year ago

about whitespace i knew that.. hmm how can i change this for a better display ?

Chris Adams Over a year ago

stackoverflow.com/questions/17232013/…

Chris Adams Over a year ago

back at you buddy

iamklaus · Accepted Answer · 2019-04-17 16:37:32Z

1

df.Sentences.apply(lambda x: re.sub('|'.join(strings_2_remove),'',x))

edited Apr 17, 2019 at 16:37

answered Apr 17, 2019 at 16:32

iamklaus

3,7682 gold badges14 silver badges21 bronze badges

Comments

Valdi_Bo · Accepted Answer · 2019-04-17 17:01:16Z

I created the test Dataframe as:

df = pd.DataFrame({ 'Summary':[
    'dogs are so cool because they are nice and funny',
    'many people love cats because cats have cute toe beans',
    'hamsters are very small and furry creatures',
    'i got a dog because i know dogs are so cool because they are nice and funny',
    'birds are funny when they dance to music, they bop up and down']})

The first step is to convert your strings_2_remove to a list of patterns (you have to import re):

pats = [ re.compile(str + ' *') for str in strings_2_remove ]

Note that each pattern is supplemented with ' *' - an optional space. Otherwise the result string could contain two adjacent spaces. As I see, other solution missed on this detail.

Then define a function to be applied:

def fn(txt):
    for pat in pats:
        if pat.search(txt):
            return pat.sub('', txt)
    return txt

For each pattern it searches the source string and if something has been found then returns the result of substitution of the matched string with an empty string. Otherwise it returns the source string.

And the only thing to do is to apply this function:

df.Summary.apply(fn)

Collectives™ on Stack Overflow

remove a list of strings from a series of strings

4 Answers 4

Comments

5 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related