Issue with regular expressions returning an extra empty string

Question

I am trying to use re.findall to get all of the Capitalized words and abbreviations. I have figured out regular expressions to find each individually, but when I try to combine the two, I end up being returned tuples with an empty string and then the item that I wanted to find.

Here is my regular expression that seems to not work- I imagine its a quick fix I am just unaware of:

x = re.findall("([A-Z][A-Za-z]+\.?)|(\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b)", txt) #just has extra "" in each set

edit:

I am currently using this as my test case:

"USA. U.S.A America."

This is my output:

[('USA.', ''), ('', 'U.S.A'), ('America.', '')]

Yes, I am currently using "USA. U.S.A America." as my test case. This is my output: [('USA.', ''), ('', 'U.S.A'), ('America.', '')] — vynabhnnqwxleicntw
– vynabhnnqwxleicntw, Commented Sep 2, 2021 at 23:44
@NielGodfreyPonciano They can be lowercase, thank you for pointing that out, that wasn't in one of my test cases — vynabhnnqwxleicntw
– vynabhnnqwxleicntw, Commented Sep 2, 2021 at 23:52
It's a good idea to show your strings as edits in the post body, with monospace font so there's no ambiguity as to the contents of the string. — ggorlen
– ggorlen, Commented Sep 2, 2021 at 23:54

Jiří Baum · Accepted Answer · 2021-09-02 23:51:18Z

1

In your regular expression, you have two sets of capturing (...), one for each alternative, so re.findall() returns a tuple of them. This is useful if you need to match several parts of a string, or if you need to know which alternative was the one that matched.

In order to get just a flat list, you'll need to either omit those or turn them into non-capturing (?:...):

x = re.findall("[A-Z][A-Za-z]+\.?|\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b", txt)

or, if the (...) were significant (or you want them for clarity):

x = re.findall("(?:[A-Z][A-Za-z]+\.?)|(?:\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b)", txt)

Either of these returns the value: ['USA.', 'U.S.A', 'America.']

answered Sep 2, 2021 at 23:51

Jiří Baum

6,9882 gold badges19 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

vynabhnnqwxleicntw Over a year ago

Thank you! Now in another test case where the abbreviation has dots throughout (such as "U.S.A."), including the end, the last period is not being captured. I switched my re to x = re.findall("(?:[A-Z][A-Za-z]+\.?)|(?:\\b[A-Z](?:[\\.&]?[A-Z]){2,}\.?\\b)", txt) to try to catch the last period like I have in the first part of the line for normal capitalized words, but it doesnt seem to work the same way.

Jiří Baum Over a year ago

Probably a good idea to switch to raw strings, for clarity...

Jiří Baum Over a year ago

For debugging regular expressions, try a site like regex101?

Niel Godfrey P. Ponciano · Accepted Answer · 2021-09-03 00:55:22Z

0

Use (?:...) to not capture a group as documented.

Here is a simplified version of the combined regex searches of the following:

Any word that starts with a capital letter
Any word that is an abbreviation/acronym marked by a separator dot (.)

We wouldn't capture those searches individually by putting (?:...) per search group. What we would do instead is capture the result of both groups e.g. ( (?:...) | (?:...) ) where the first (?:...) is for the capital letter search and the second (?:...) is for the acronym search.

import re

txt = "USA. U.S.A   America. arctic u.s.a Mars v.. A.b earth c.D.e. .pluto nep.tune. uranus. f.g.h.i Sun  "
matches = re.findall("((?:[A-Z]\w+)|(?:\w+\.+\w+[\w\.]*))", txt)
print(matches)

['USA', 'U.S.A', 'America', 'u.s.a', 'Mars', 'A.b', 'c.D.e.', 'nep.tune.', 'f.g.h.i', 'Sun']

edited Sep 3, 2021 at 0:55

answered Sep 3, 2021 at 0:21

Niel Godfrey P. Ponciano

10.8k1 gold badge24 silver badges32 bronze badges

2 Comments

vynabhnnqwxleicntw Over a year ago

Ahh I see. I clearly have to review how capture groups work. I added a part to check to see if the capitalized word ends with a period, but everything else seems good. Really appreciate the help!

Niel Godfrey P. Ponciano Over a year ago

In my answer, it would omit the ending period for capitalized words. If you don't like that, just change (?:[A-Z]\w+) to (?:[A-Z]\S+). In the example, this will change the captured 'USA' to 'USA.'.

Collectives™ on Stack Overflow

Issue with regular expressions returning an extra empty string

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related