3

I am trying to extract wanted text from a given set of text. I have created below function.

   def extract_name(title):
        matches = re.findall(r'\b[A-Z0-9\s&.,()-]+(?:\s*\(\d\))?\b', title)
        return ', '.join(matches) if matches else None

But, it produces unwanted (, , for some titles. For example, my title are like below.

THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD (In Relation), NANO CARE LIMITED (In Relation)

Expected outcome: THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD, NANO CARE LIMITED

3 Answers 3

2

Instead of using re.findall(), I recommend that you use re.sub(), to remove the unwanted parts. With this pattern you can explicitly define what you want to keep and what you do not want to keep, and you can add other alternatives to reflect that.

In this pattern, you match (and capture) first what you want to keep, and then you match what you DO NOT want to keep. You replace what you want to keep with itself (the match), and you DO NOT REPLACE what you DO NOT want to keep, i.e. effectively what you DO NOT want to keep is deleted. Regex always matches from left to right, so the second alternative will only be matched if the first alternative does not match first.

REGEX PATTERN (Python flavor):

([ ]?\(\d+\))|[ ]?\([^)]*\)

Regex demo: https://regex101.com/r/Peu1Fw/4

CODE PYTHON (with re module):

title = 'THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD (In Relation), NANO CARE LIMITED (In Relation)'

import re
pattern = r'([ ]?\(\d+\))|[ ]?\([^)]*\)'
replacement = r'\1'
updated_title = re.sub(pattern, replacement, title)

print(f'OLD: "{title}"')
print(f'NEW: "{updated_title}"')
print('EXP: "THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD, NANO CARE LIMITED"')

RESULT:

OLD: "THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD (In Relation), NANO CARE LIMITED (In Relation)"
NEW: "THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD, NANO CARE LIMITED"
EXP: "THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD, NANO CARE LIMITED"

REGEX PATTERN NOTES:

  • Begin first alternative to capture what you want to keep:
  • ( Begin first capture group (...), group 1. Referred to as \1 in the replacement string.
    • [ ]? Match one literal space character 0 or 1 times (?)
    • \( Match literal (
    • \d+ Match digit 1 or more times (+).
    • \) Match literal )
  • ) End group 1 (\1).
  • | OR in alteration, ...|....
  • Begin second alternative* to match what you want deleted:
  • [ ]? Match one literal space character 0 or 1 times (?
  • \( Match literal (
  • [^)]* Negated character class [^...]. Match any character that is not a literal ) 0 or more times (*). NOTE: This means that empty parentheses will be matched and therefore deleted from the updated string.
  • \) Match literal )

UPDATED REGEX PATTERN This updated pattern removes one space character, if there is one, either before or after the string we want to remove.

For example, if the string we want to remove, (In relation), is at the beginning of the test string followed by a space, e.g. (In Relation) THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD(In Relation), NANO CARE LIMITED(In Relation)

REGEX PATTERN (Python flavor):

([ ]?\(\d+\))|([ ])?(?(2)\([^)]*\)|\([^)]*\)[ ]?)

Regex demo: https://regex101.com/r/Peu1Fw/6

Question, what would be a better way to remove a space either before of after (not both) the string we want to remove in Python or with regex (Python flavor)?

Sign up to request clarification or add additional context in comments.

3 Comments

"Question, what would be...": The answer section is not intended for asking questions.
Thanks @rich, for the detailed explanation. works well.
Matches: FIFTY Number (2 and Number (1 LIMITED (In Relation) regex101.com/r/mSnhyR/1
1

This just requires the first letter be a cap letter, number.
The middle / end can be any of the class chars with optional digits in parenthesis.

\b[A-Z0-9](?:[A-Z0-9\s&.-](?:\s*\(\d+\))?)*\b

https://regex101.com/r/0xVjur/1

\b 
[A-Z0-9] 
(?:
   [A-Z0-9\s&.-] 
   (?:
      \s* \( \d+ \) 
   )?
)*
\b 

Alternative regex replace \s with space and tab in class. And convert the
trailing word boundary to also check for a trailing close parenthesis to allow
trailing number in parenthesis.

\b[A-Z0-9-](?:[A-Z0-9 \t&.-](?:\s*\(\d+\))?)*(?:\b|(?<=\)))

https://regex101.com/r/n83nVu/1

Comments

0

this one will work it captures any alphabetical letter inside parentheses, remove it from the text, and eliminates any spaces after or before.

.\(([A-Z-a-z].*?)\)
 def extract_name(title):
          name = re.sub(r'.\(([A-Z-a-z].*?)\)', '', title)
          return name.strip()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.