0

Good morning, I have a question, using Webscraping, I extract an information in string format like this:

"Issued May2018No expiration date"

what I want is to split this string into 2 strings by using regular expression, my idea is: whenever you find 4 digits followed by "No", I want to create the following string:

"Issued May2018 - No expiration date".

In this way, I'm able to use the method "split" applied to "-" in a way that I achieve two strings:

  • Issued May2018
  • No expiration date

I was thinking using regex with

\d\d\d\dNo

and it should be able to recognise 2018No, but I don't know how to proceed in order that I can replace it with

May2018 - No expiration date 

and set the floor for using the split function

Any suggestions? other approaches are well suggested

3 Answers 3

1

You can use a capture group to capture 4 digits followed by matching No

In the replacement use the capture group 1 value followed by - No

import re

s = "Issued May2018No expiration date"
pattern = r"(\d{4})No "
print(re.sub(pattern, r"\1 - No ", s))

Output

Issued May2018 - No expiration date

See a Python demo and a regex demo.

Sign up to request clarification or add additional context in comments.

5 Comments

+ But I think OP needs to use re.findall(). His intention is to first adjust the string to then split it on the hyphen. It can be done in a single go rather. No?
Thanks for your reply, I have another question: there is a way to write an if statement to say that if that regex does not match, look at another regex? for instance, sometimes it can happen that the information extracted can be like "Issued May2018 • No expiration date", so the regex does not correspond in that case
@Iacopo_Biondini you don't really need 2 pattern for that, you can use for example (\d{4})\W*No regex101.com/r/KxudVk/1
@JvdV If the OP wants 2 capture groups, then I would use something along (.*?\d{4})\W*(No .*) regex101.com/r/vA11kc/1
@Thefourthbird, yes I think that would work too! It would save him another Split() operation since his end-goal seems to end up with those two strings as per his sample data.
1

Use re.sub.

\g<1> is represented in the string passed to the repl parameter of re.sub() as the result of a match for reference group 1.

import re

s = "Issued May2018No expiration date"
print(re.sub("(\d{4})(No)", "\g<1> - \g<2>", s))

# 'Issued May2018 - No expiration date'

Comments

1
import re

string = "Issued May2018No expiration date"

m = re.findall(r"^(.*[0-9]{4})(No.*)$", string)

print(m[0][0] + " - " + m[0][1])

->

Issued May2018 - No expiration date

2 Comments

I think this is what OP would be after since it would save him another split operation too. However, you might want to just use re.findall() then. >> print(re.findall(r'^(.*\d{4})(No.*)$', s))
@JvdV agreed, I have updated the answer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.