Folks,
I am not an expert in regular expressions and I've searched Google for my problem but haven't found a solution. If anybody finds another SO post with same question, please feel free to point to that post.
Question:
I got a text file with much of the characters as html tags. These text files may contain PDF filename as shown below. I just want to extract all such PDF filenames with .pdf extension. Note that these PDF filenames may come anywhere in the text document string, not only after <FILENAME> prefix.
Example Text:
Example 1: <FILENAME>any_valid_characters_filename.pdf
Example 2: hello this is a good file abc-def_xyz-1.pdf
Note here <FILENAME> is a valid (html) tag in my text document. I want to extract the filename any_valid_characters_filename.pdf and abc-def_xyz-1.pdf. These valid characters for PDF filename could be a-Z, A-Z, _, -, ., 0-9 but not special characters like <, > etc.
What I have tried so far:
r'\b(\w+\.pdf)\b'
r'^\\(.+\\)*(.+)\.(.+)\.pdf$'
r'[^A-Za-z0-9_\.pdf]'
r'[\\/:"*?<>|]+\.pdf'
and bunch of other regex expressions but did not have success.
Any help would be appreciated. Thank you.
r'>([^\s\\/:"*?<>|]+\.pdf)\b'\sfrom the pattern. See regex demo. There is no error in my pattern, but in your code, I do not see what code you have, please add it to the question.re.findall(r">([^\\/:"*?<>|]+\.pdf)\b", "<FILENAME>abc-1def.pdf")throws the syntax error.