1

I am trying to use re.findall to get all of the Capitalized words and abbreviations. I have figured out regular expressions to find each individually, but when I try to combine the two, I end up being returned tuples with an empty string and then the item that I wanted to find.

Here is my regular expression that seems to not work- I imagine its a quick fix I am just unaware of:

x = re.findall("([A-Z][A-Za-z]+\.?)|(\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b)", txt) #just has extra "" in each set

edit:

I am currently using this as my test case:

"USA. U.S.A America."

This is my output:

[('USA.', ''), ('', 'U.S.A'), ('America.', '')]
8
  • 2
    Could you share a sample value of txt? Commented Sep 2, 2021 at 23:41
  • Yes, I am currently using "USA. U.S.A America." as my test case. This is my output: [('USA.', ''), ('', 'U.S.A'), ('America.', '')] Commented Sep 2, 2021 at 23:44
  • 2
    What output were you expecting? Commented Sep 2, 2021 at 23:47
  • 1
    @NielGodfreyPonciano They can be lowercase, thank you for pointing that out, that wasn't in one of my test cases Commented Sep 2, 2021 at 23:52
  • 1
    It's a good idea to show your strings as edits in the post body, with monospace font so there's no ambiguity as to the contents of the string. Commented Sep 2, 2021 at 23:54

2 Answers 2

1

In your regular expression, you have two sets of capturing (...), one for each alternative, so re.findall() returns a tuple of them. This is useful if you need to match several parts of a string, or if you need to know which alternative was the one that matched.

In order to get just a flat list, you'll need to either omit those or turn them into non-capturing (?:...):

x = re.findall("[A-Z][A-Za-z]+\.?|\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b", txt)

or, if the (...) were significant (or you want them for clarity):

x = re.findall("(?:[A-Z][A-Za-z]+\.?)|(?:\\b[A-Z](?:[\\.&]?[A-Z]){2,}\\b)", txt)

Either of these returns the value: ['USA.', 'U.S.A', 'America.']

Sign up to request clarification or add additional context in comments.

3 Comments

Thank you! Now in another test case where the abbreviation has dots throughout (such as "U.S.A."), including the end, the last period is not being captured. I switched my re to x = re.findall("(?:[A-Z][A-Za-z]+\.?)|(?:\\b[A-Z](?:[\\.&]?[A-Z]){2,}\.?\\b)", txt) to try to catch the last period like I have in the first part of the line for normal capitalized words, but it doesnt seem to work the same way.
Probably a good idea to switch to raw strings, for clarity...
For debugging regular expressions, try a site like regex101?
0

Use (?:...) to not capture a group as documented.

Here is a simplified version of the combined regex searches of the following:

  • Any word that starts with a capital letter
  • Any word that is an abbreviation/acronym marked by a separator dot (.)

We wouldn't capture those searches individually by putting (?:...) per search group. What we would do instead is capture the result of both groups e.g. ( (?:...) | (?:...) ) where the first (?:...) is for the capital letter search and the second (?:...) is for the acronym search.

import re

txt = "USA. U.S.A   America. arctic u.s.a Mars v.. A.b earth c.D.e. .pluto nep.tune. uranus. f.g.h.i Sun  "
matches = re.findall("((?:[A-Z]\w+)|(?:\w+\.+\w+[\w\.]*))", txt)
print(matches)
['USA', 'U.S.A', 'America', 'u.s.a', 'Mars', 'A.b', 'c.D.e.', 'nep.tune.', 'f.g.h.i', 'Sun']

2 Comments

Ahh I see. I clearly have to review how capture groups work. I added a part to check to see if the capitalized word ends with a period, but everything else seems good. Really appreciate the help!
In my answer, it would omit the ending period for capitalized words. If you don't like that, just change (?:[A-Z]\w+) to (?:[A-Z]\S+). In the example, this will change the captured 'USA' to 'USA.'.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.