Python regular expression for text search

Question

I am trying to extract wanted text from a given set of text. I have created below function.

   def extract_name(title):
        matches = re.findall(r'\b[A-Z0-9\s&.,()-]+(?:\s*\(\d\))?\b', title)
        return ', '.join(matches) if matches else None

But, it produces unwanted (, , for some titles. For example, my title are like below.

THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD (In Relation), NANO CARE LIMITED (In Relation)

Expected outcome: THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD, NANO CARE LIMITED

rich neadle · Accepted Answer · 2025-04-17 20:41:23Z

2

Instead of using re.findall(), I recommend that you use re.sub(), to remove the unwanted parts. With this pattern you can explicitly define what you want to keep and what you do not want to keep, and you can add other alternatives to reflect that.

In this pattern, you match (and capture) first what you want to keep, and then you match what you DO NOT want to keep. You replace what you want to keep with itself (the match), and you DO NOT REPLACE what you DO NOT want to keep, i.e. effectively what you DO NOT want to keep is deleted. Regex always matches from left to right, so the second alternative will only be matched if the first alternative does not match first.

REGEX PATTERN (Python flavor):

([ ]?\(\d+\))|[ ]?\([^)]*\)

Regex demo: https://regex101.com/r/Peu1Fw/4

CODE PYTHON (with re module):

title = 'THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD (In Relation), NANO CARE LIMITED (In Relation)'

import re
pattern = r'([ ]?\(\d+\))|[ ]?\([^)]*\)'
replacement = r'\1'
updated_title = re.sub(pattern, replacement, title)

print(f'OLD: "{title}"')
print(f'NEW: "{updated_title}"')
print('EXP: "THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD, NANO CARE LIMITED"')

RESULT:

OLD: "THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD (In Relation), NANO CARE LIMITED (In Relation)"
NEW: "THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD, NANO CARE LIMITED"
EXP: "THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD, NANO CARE LIMITED"

REGEX PATTERN NOTES:

Begin first alternative to capture what you want to keep:
( Begin first capture group (...), group 1. Referred to as \1 in the replacement string.
- [ ]? Match one literal space character 0 or 1 times (?)
- \( Match literal (
- \d+ Match digit 1 or more times (+).
- \) Match literal )
) End group 1 (\1).
| OR in alteration, ...|....
Begin second alternative* to match what you want deleted:
[ ]? Match one literal space character 0 or 1 times (?
\( Match literal (
[^)]* Negated character class [^...]. Match any character that is not a literal ) 0 or more times (*). NOTE: This means that empty parentheses will be matched and therefore deleted from the updated string.
\) Match literal )

UPDATED REGEX PATTERN This updated pattern removes one space character, if there is one, either before or after the string we want to remove.

For example, if the string we want to remove, (In relation), is at the beginning of the test string followed by a space, e.g. (In Relation) THETA COMMERCIALS (2005) LIMITED, TEST CONNECTIONS LTD(In Relation), NANO CARE LIMITED(In Relation)

REGEX PATTERN (Python flavor):

([ ]?\(\d+\))|([ ])?(?(2)\([^)]*\)|\([^)]*\)[ ]?)

Regex demo: https://regex101.com/r/Peu1Fw/6

Question, what would be a better way to remove a space either before of after (not both) the string we want to remove in Python or with regex (Python flavor)?

edited Apr 17 at 20:41

answered Apr 4 at 3:09

rich neadle

8466 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

trincot Apr 4 at 6:11

"Question, what would be...": The answer section is not intended for asking questions.

Totura Apr 5 at 6:27

Thanks @rich, for the detailed explanation. works well.

sln Apr 6 at 17:38

Matches: FIFTY Number (2 and Number (1 LIMITED (In Relation) regex101.com/r/mSnhyR/1

sln · Accepted Answer · 2025-04-04 20:27:34Z

1

This just requires the first letter be a cap letter, number.
The middle / end can be any of the class chars with optional digits in parenthesis.

\b[A-Z0-9](?:[A-Z0-9\s&.-](?:\s*\(\d+\))?)*\b

https://regex101.com/r/0xVjur/1

\b 
[A-Z0-9] 
(?:
   [A-Z0-9\s&.-] 
   (?:
      \s* \( \d+ \) 
   )?
)*
\b

Alternative regex replace \s with space and tab in class. And convert the
trailing word boundary to also check for a trailing close parenthesis to allow
trailing number in parenthesis.

\b[A-Z0-9-](?:[A-Z0-9 \t&.-](?:\s*\(\d+\))?)*(?:\b|(?<=\)))

https://regex101.com/r/n83nVu/1

edited Apr 4 at 20:27

answered Apr 4 at 19:54

sln

3,6431 gold badge7 silver badges13 bronze badges

Comments

user11595058 · Accepted Answer · 2025-04-04 03:28:42Z

0

this one will work it captures any alphabetical letter inside parentheses, remove it from the text, and eliminates any spaces after or before.

.\(([A-Z-a-z].*?)\)

 def extract_name(title):
          name = re.sub(r'.\(([A-Z-a-z].*?)\)', '', title)
          return name.strip()

edited Apr 4 at 3:28

answered Apr 4 at 3:22

user11595058

515 bronze badges

Collectives™ on Stack Overflow

Python regular expression for text search

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related