Extract filename from text file using regex

Question

Folks,

I am not an expert in regular expressions and I've searched Google for my problem but haven't found a solution. If anybody finds another SO post with same question, please feel free to point to that post.

Question:

I got a text file with much of the characters as html tags. These text files may contain PDF filename as shown below. I just want to extract all such PDF filenames with .pdf extension. Note that these PDF filenames may come anywhere in the text document string, not only after <FILENAME> prefix.

Example Text:

Example 1: <FILENAME>any_valid_characters_filename.pdf
Example 2: hello this is a good file abc-def_xyz-1.pdf

Note here <FILENAME> is a valid (html) tag in my text document. I want to extract the filename any_valid_characters_filename.pdf and abc-def_xyz-1.pdf. These valid characters for PDF filename could be a-Z, A-Z, _, -, ., 0-9 but not special characters like <, > etc.

What I have tried so far:

r'\b(\w+\.pdf)\b'
r'^\\(.+\\)*(.+)\.(.+)\.pdf$'
r'[^A-Za-z0-9_\.pdf]' 
r'[\\/:"*?<>|]+\.pdf'

and bunch of other regex expressions but did not have success.

Any help would be appreciated. Thank you.

If the filenames cannot contain whitespaces, you may use something like r'>([^\s\\/:"*?<>|]+\.pdf)\b' — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 18, 2018 at 20:43
Let's assume filenames may contain whitespaces (although unlikely). Does this still work? — Saurabh Gokhale
– Saurabh Gokhale, Commented Nov 18, 2018 at 20:44
@WiktorStribiżew Using your regex, it throws me Syntax error at ? character. — Saurabh Gokhale
– Saurabh Gokhale, Commented Nov 18, 2018 at 20:45
Remove \s from the pattern. See regex demo. There is no error in my pattern, but in your code, I do not see what code you have, please add it to the question. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Nov 18, 2018 at 20:45
Using re.findall(r">([^\\/:"*?<>|]+\.pdf)\b", "<FILENAME>abc-1def.pdf") throws the syntax error. — Saurabh Gokhale
– Saurabh Gokhale, Commented Nov 18, 2018 at 20:48

Aurora Wang · Accepted Answer · 2018-11-18 21:20:31Z

3

I think the following expression covers everything you mentioned:

r"([\w\d\-.]+\.pdf)"

As it matches any composition with a word character, a digit character, a - symbol and a . symbol followed by .pdf.

answered Nov 18, 2018 at 21:20

Aurora Wang

1,95017 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Saurabh Gokhale Over a year ago

Thanks. If I just want to use the first example case, (and ignore the second example match), what changes do I need? Note I want to ignore all whitespaces between the strings "<FILENAME>" and ".pdf" in the example 1. If there are whitespaces between the strings "<FILENAME>" and ".pdf" I want to ignore this match

Aurora Wang Over a year ago

Try r"<FILENAME>\s*([\w\d\-.]+\.pdf)", it will match <FILENAME> followed by a valid pdf name.

Saurabh Gokhale Over a year ago

Thanks for your quick reply. But this doesn't work when there is a space after > and before start of any word or filename. Check here ideone.com/xPQTYI

Rocky Li · Accepted Answer · 2018-11-19 00:16:58Z

1

Can this work?

\b[^\s<>]*?.pdf\b

It works for your examples: https://regexr.com/43b8q

Update for your new request that no space exist between <FILENAME> and whatever.pdf:

Use: \b(?<![<>][\s]|\w)[\w-]*?.pdf\b

example: https://regex101.com/r/O3kpQ4/2/

edited Nov 19, 2018 at 0:16

answered Nov 18, 2018 at 21:15

Rocky Li

5,9862 gold badges21 silver badges36 bronze badges

4 Comments

Saurabh Gokhale Over a year ago

Thanks. If I just want to use the first example case, (and ignore the second example match), what changes do I need? Note I want to ignore all whitespaces between the strings "<FILENAME>" and ".pdf" in the example 1. If there are whitespaces between the strings "<FILENAME>" and ".pdf" I want to ignore this match

Rocky Li Over a year ago

Try this? \b(?<![<>][\s]|\w)[\w-]*?.pdf\b

Saurabh Gokhale Over a year ago

Thanks. But I want to ignore Example 2 because it doesn't have prefix of <FILENAME> in your updated shared link

Rocky Li Over a year ago

All of this should go into your question description. Now the regex string will be completely different - and people can't help you if you keep moving the goal post. Here \b(?<=<.*>)[\w-]*?.pdf\b

Collectives™ on Stack Overflow

Extract filename from text file using regex

2 Answers 2

3 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related