1

Folks,

I am not an expert in regular expressions and I've searched Google for my problem but haven't found a solution. If anybody finds another SO post with same question, please feel free to point to that post.

Question:

I got a text file with much of the characters as html tags. These text files may contain PDF filename as shown below. I just want to extract all such PDF filenames with .pdf extension. Note that these PDF filenames may come anywhere in the text document string, not only after <FILENAME> prefix.

Example Text:

Example 1: <FILENAME>any_valid_characters_filename.pdf
Example 2: hello this is a good file abc-def_xyz-1.pdf

Note here <FILENAME> is a valid (html) tag in my text document. I want to extract the filename any_valid_characters_filename.pdf and abc-def_xyz-1.pdf. These valid characters for PDF filename could be a-Z, A-Z, _, -, ., 0-9 but not special characters like <, > etc.

What I have tried so far:

r'\b(\w+\.pdf)\b'
r'^\\(.+\\)*(.+)\.(.+)\.pdf$'
r'[^A-Za-z0-9_\.pdf]' 
r'[\\/:"*?<>|]+\.pdf'

and bunch of other regex expressions but did not have success.

Any help would be appreciated. Thank you.

13
  • If the filenames cannot contain whitespaces, you may use something like r'>([^\s\\/:"*?<>|]+\.pdf)\b' Commented Nov 18, 2018 at 20:43
  • Let's assume filenames may contain whitespaces (although unlikely). Does this still work? Commented Nov 18, 2018 at 20:44
  • @WiktorStribiżew Using your regex, it throws me Syntax error at ? character. Commented Nov 18, 2018 at 20:45
  • Remove \s from the pattern. See regex demo. There is no error in my pattern, but in your code, I do not see what code you have, please add it to the question. Commented Nov 18, 2018 at 20:45
  • Using re.findall(r">([^\\/:"*?<>|]+\.pdf)\b", "<FILENAME>abc-1def.pdf") throws the syntax error. Commented Nov 18, 2018 at 20:48

2 Answers 2

3

I think the following expression covers everything you mentioned:

r"([\w\d\-.]+\.pdf)"

As it matches any composition with a word character, a digit character, a - symbol and a . symbol followed by .pdf.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks. If I just want to use the first example case, (and ignore the second example match), what changes do I need? Note I want to ignore all whitespaces between the strings "<FILENAME>" and ".pdf" in the example 1. If there are whitespaces between the strings "<FILENAME>" and ".pdf" I want to ignore this match
Try r"<FILENAME>\s*([\w\d\-.]+\.pdf)", it will match <FILENAME> followed by a valid pdf name.
Thanks for your quick reply. But this doesn't work when there is a space after > and before start of any word or filename. Check here ideone.com/xPQTYI
1

Can this work?

\b[^\s<>]*?.pdf\b

It works for your examples: https://regexr.com/43b8q

Update for your new request that no space exist between <FILENAME> and whatever.pdf:

Use: \b(?<![<>][\s]|\w)[\w-]*?.pdf\b

example: https://regex101.com/r/O3kpQ4/2/

4 Comments

Thanks. If I just want to use the first example case, (and ignore the second example match), what changes do I need? Note I want to ignore all whitespaces between the strings "<FILENAME>" and ".pdf" in the example 1. If there are whitespaces between the strings "<FILENAME>" and ".pdf" I want to ignore this match
Try this? \b(?<![<>][\s]|\w)[\w-]*?.pdf\b
Thanks. But I want to ignore Example 2 because it doesn't have prefix of <FILENAME> in your updated shared link
All of this should go into your question description. Now the regex string will be completely different - and people can't help you if you keep moving the goal post. Here \b(?<=<.*>)[\w-]*?.pdf\b

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.