1

I have pretty messy data I am trying to replace rows that might contain only 1 word or string with '' or empty string.

Here is the original data:

df = pd.DataFrame({'some_text': [
        'I enjoy read Mark Twain\'s Books',
        'Library is very useful',
        '/',
        '\\',
        '/ /',
        '',
        'I enjoy read Mark Twain\'s Books',
        'an',
        'the',
        'Books are interesting'
]})

I tried this: ( this is dropping rows). I don't want to drop the rows just replace it.

count = df['some_text'].str.split().str.len()
df[~(count==1)]

Final output needed:

I enjoy read Mark Twain's Books
Library is very useful


/ /

I enjoy read Mark Twain's Books


Books are interesting

3 Answers 3

2

You can use a simple regex here:

df['new_text'] = df['some_text'].str.replace('^\S+$','');
>>> df
                         some_text                         new_text
0  I enjoy read Mark Twain's Books  I enjoy read Mark Twain's Books
1           Library is very useful           Library is very useful
2                                /                                 
3                                \                                 
4                              / /                              / /
5                                                                  
6  I enjoy read Mark Twain's Books  I enjoy read Mark Twain's Books
7                               an                                 
8                              the                                 
9            Books are interesting            Books are interesting
Sign up to request clarification or add additional context in comments.

1 Comment

Note that this regex will not replace strings that have only one word but which also have leading or trailing whitespace, though it could be modified to do so if desired.
2

With the implementation you made, instead of drop the rows, asign a new value like this:

count = df['some_text'].str.split().str.len()
df[count == 1] = ""

Comments

1

You can apply the transformation to the column without a mask:

df['replaced_text'] = df['some_text'].apply(lambda x: '' if len(x.strip().split()) == 1  else x) 
print(df.to_string())
df
>>

                         some_text                    replaced_text
0  I enjoy read Mark Twain's Books  I enjoy read Mark Twain's Books
1           Library is very useful           Library is very useful
2                                /                                 
3                                \                                 
4                              / /                              / /
5                                                                  
6  I enjoy read Mark Twain's Books  I enjoy read Mark Twain's Books
7                               an                                 
8                              the                                 
9            Books are interesting            Books are interesting

Very similar to what you have applied, the lambda function checks each string with whitespaces stripped which have length equals 1 and replace it with ''.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.