1

I trying to write a pattern to get each CPNJ group inside a this block of text, but the condition is that, is needed starts with executados: and ends with a CNPJ group. But, my pattern always get the last group, I don't know what I should do for it's works.

The answer getting specific groups of patterns inside a block text does not works!

regex101

pattern: (?:executados\:)[\p{L}\s\D\d]+CNPJ\W+(?P<cnpj>\d+\.\d+\.\d+\/\d+-\d+)

string to test:

Dados dos executados:
1. FOO TEST STRING LTDA., CNPJ: 88.888.888/8888-88,
2. ANOTHER TEST STRING LTDA LTDA LTDA - ME, CNPJ: 99.999.999/9999-99,
3. FOO TEST STRING LTDA., CPF: 999.999.999-99,
4. FOO TEST STRING LTDA., CPF: 999.999.999-99.
Como medida de economia e celeridade processuais, atribuo a

I would to get the values {'cnpj': ['88.888.888/8888-88', '99.999.999/9999-99']}, this way is getting just the last.

10
  • 1
    Use a regular approach like ideone.com/tVQC61 Commented Nov 22, 2021 at 18:01
  • @WiktorStribiżew I saw it, but I need that condition be respected, in this case, not get simple the CNPJ group, but, get all CNPJ group after executados: Commented Nov 22, 2021 at 18:05
  • Yes, and you get only those! Did you notice text[text.index("executados:"):])? Commented Nov 22, 2021 at 18:10
  • hmm, sry, I saw it now! But, it's possible specift it in the pattern instead of code? Commented Nov 22, 2021 at 18:16
  • Only as TheFourthBird showed, with PyPi regex module. See this demo. Commented Nov 22, 2021 at 18:16

1 Answer 1

2

You can use PyPi regex module with the regex like

(?s)(?<=executados:.*?)CNPJ\W+(\d+\.\d+\.\d+/\d+-\d+)

See the regex demo.

Here is the Python demo:

import regex
text = """Dados dos executados:
1. FOO TEST STRING LTDA., CNPJ: 99.999.999/9999-99,
2. ANOTHER TEST STRING LTDA LTDA LTDA - ME, CNPJ: 99.999.999/9999-99,
3. FOO TEST STRING LTDA., CPF: 999.999.999-99,
4. FOO TEST STRING LTDA., CPF: 999.999.999-99.
Como medida de economia e celeridade processuais, atribuo a"""
print( regex.findall(r'(?s)(?<=executados:.*?)CNPJ\W+(\d+\.\d+\.\d+/\d+-\d+)', text) )

yielding

['99.999.999/9999-99', '99.999.999/9999-99']

The regex matches

  • (?s) - regex.DOTALL, enables . to match line break chars
  • (?<=executados:.*?) - right before the current location, there must be executados: and then any zero or more chars
  • CNPJ - a fixed string
  • \W+ - one or more non-word chars
  • (\d+\.\d+\.\d+/\d+-\d+) - the return value of regex.findall, Group 1: one or more digits and a . twice, then one or more digits, /, one or more digits, -` and one or more digits.
Sign up to request clarification or add additional context in comments.

1 Comment

Module regex is great and definitely works in some situations. But, Python official modulere for RE handling has warned about the non-support for fixed-width lookbehind. It might be better to use fix-width lookbehind (i.e., ((?<=executados).)*), which is based on the official module re. It's a fact thatre has much more likely stable performance than any other counterparts, since cpython had 40k+ stars, while regex got merely dozens.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.