Using regex to alter digits pandas

Question

Background

I have the following df

import pandas as pd
df = pd.DataFrame({'Text' : ['But the here is \nBase ID: 666666    \nDate is Here 123456 ', 
                                   '999998 For \nBase ID: 123456    \nDate  there', 
                                   'So so \nBase ID: 939393    \nDate hey the 123455 ',],
                      'ID': [1,2,3],
                       'P_ID': ['A','B','C'],

                     })

Output

    ID  P_ID    Text
0   1   A   But the here is \nBase ID: 666666 \nDate is Here 123456
1   2   B   999998 For \nBase ID: 123456 \nDate there
2   3   C   So so \nBase ID: 939393 \nDate hey the 123455

Tried

I have tried the following to **BLOCK** the 6 digits in between \nBase ID: and \nDate

df['New_Text'] = df['Text'].str.replace('ID:(.+?)','ID:**BLOCK**')

And I get the following

  ID P_ID Text New_Text
0               But the here is \nBase ID:**BLOCK**666666 \nDate is Here 123456
1               999998 For \nBase ID:**BLOCK**123456 \nDate there
2               So so \nBase ID:**BLOCK**939393 \nDate hey the 123455

But I don't quite get what I want

Desired Output

  ID P_ID Text New_Text
0               But the here is \nBase ID:**BLOCK** \nDate is Here 123456
1               999998 For \nBase ID:**BLOCK** \nDate there
2               So so \nBase ID:**BLOCK** \nDate hey the 123455

Question

How do I tweak str.replace('ID:(.+?)','ID:**BLOCK**') part of my code to get my desired output?

Try ID:\s*(\S+)

user557597
– user557597

2019-08-18 22:11:51 +00:00
Commented Aug 18, 2019 at 22:11 — user557597
– user557597, Commented Aug 18, 2019 at 22:11

Xukrao · Accepted Answer · 2019-08-18 22:11:51Z

1

df['New_Text'] = df['Text'].str.replace(r'ID: *\d+ *', 'ID:**BLOCK** ')

See here for a detailed break-down of the used regex pattern.

answered Aug 18, 2019 at 22:11

Xukrao

8,6745 gold badges29 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Moshel · Accepted Answer · 2019-08-18 22:12:38Z

1

try df['New_Text'] = df['Text'].str.replace('ID:(.+?)\n','ID:**BLOCK**\n')

regexp match the shortest possible string, in your case ' '

answered Aug 18, 2019 at 22:12

Moshel

4285 silver badges13 bronze badges

Comments

noufel13 · Accepted Answer · 2019-08-18 22:17:35Z

1

You can try with below piece of code to get your desired output,

df['New_Text'] = df['Text'].str.replace('ID:\s+[0-9]+','ID:**BLOCK**')

Output:

0    But the here is \nCase ID:**BLOCK**    \nDate is Here 123456 
1    999998 For \nCase ID:**BLOCK**    \nDate  there              
2    So so \nCase ID:**BLOCK**    \nDate hey the 123455

Regex Breakdown:

'\s+' - to indicate space(s)

'[0-9]+' - to specify a number

answered Aug 18, 2019 at 22:17

noufel13

6634 silver badges4 bronze badges

Collectives™ on Stack Overflow

Using regex to alter digits pandas

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related