1

Background

I have the following df

import pandas as pd
df = pd.DataFrame({'Text' : ['But the here is \nBase ID: 666666    \nDate is Here 123456 ', 
                                   '999998 For \nBase ID: 123456    \nDate  there', 
                                   'So so \nBase ID: 939393    \nDate hey the 123455 ',],
                      'ID': [1,2,3],
                       'P_ID': ['A','B','C'],

                     })

Output

    ID  P_ID    Text
0   1   A   But the here is \nBase ID: 666666 \nDate is Here 123456
1   2   B   999998 For \nBase ID: 123456 \nDate there
2   3   C   So so \nBase ID: 939393 \nDate hey the 123455

Tried

I have tried the following to **BLOCK** the 6 digits in between \nBase ID: and \nDate

df['New_Text'] = df['Text'].str.replace('ID:(.+?)','ID:**BLOCK**')

And I get the following

  ID P_ID Text New_Text
0               But the here is \nBase ID:**BLOCK**666666 \nDate is Here 123456
1               999998 For \nBase ID:**BLOCK**123456 \nDate there
2               So so \nBase ID:**BLOCK**939393 \nDate hey the 123455

But I don't quite get what I want

Desired Output

  ID P_ID Text New_Text
0               But the here is \nBase ID:**BLOCK** \nDate is Here 123456
1               999998 For \nBase ID:**BLOCK** \nDate there
2               So so \nBase ID:**BLOCK** \nDate hey the 123455

Question

How do I tweak str.replace('ID:(.+?)','ID:**BLOCK**') part of my code to get my desired output?

1
  • 2
    Try ID:\s*(\S+) Commented Aug 18, 2019 at 22:11

3 Answers 3

1
df['New_Text'] = df['Text'].str.replace(r'ID: *\d+ *', 'ID:**BLOCK** ')

See here for a detailed break-down of the used regex pattern.

Sign up to request clarification or add additional context in comments.

Comments

1

try df['New_Text'] = df['Text'].str.replace('ID:(.+?)\n','ID:**BLOCK**\n')

regexp match the shortest possible string, in your case ' '

Comments

1

You can try with below piece of code to get your desired output,

df['New_Text'] = df['Text'].str.replace('ID:\s+[0-9]+','ID:**BLOCK**')

Output:

0    But the here is \nCase ID:**BLOCK**    \nDate is Here 123456 
1    999998 For \nCase ID:**BLOCK**    \nDate  there              
2    So so \nCase ID:**BLOCK**    \nDate hey the 123455           

Regex Breakdown:

'\s+' - to indicate space(s)

'[0-9]+' - to specify a number

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.