1

I am trying to use regex to find all the matched patterns in a BibTex file. The file looks like this:

bib_file = """
@article{Fu_2007_ssr,
doi = {10.1016/j.surfrep.2007.07.001}
}

@article{Shibuya_2007_apl,
 doi = {10.1063/1.2816907}
}
"""

My goal is to find all the matched patterns with is from @article to } and put these patterns into a list. So my final list will be like this:

['@article{Fu_2007_ssr,\n  doi = {10.1016/j.surfrep.2007.07.001}\n   }',
 '@article{Shibuya_2007_apl,\n  doi = {10.1063/1.2816907}\n    }']

Currently, I have my code:

    rx_sequence = re.compile(r'(@article(.*)}\n)', re.DOTALL)
    article = rx_sequence.search(bib_file).group(1)

But the article is a string, how can I find each matched pattern and append it to a list?

4
  • articles= list(rx_sequence.finditer(bib_file))? Commented Aug 15, 2016 at 17:03
  • @Rawing Just tried that. Doesn't seem to work though Commented Aug 15, 2016 at 17:06
  • Why not use a Python bibtexparser? Commented Aug 15, 2016 at 17:07
  • @MosesKoledoye: Thanks for letting me know this package. Commented Aug 15, 2016 at 17:17

2 Answers 2

1

You can match all these articles with

r"(@article.*?\n[ \t]*}[ \t]*)(?:\n|$)"

(to be used with re.DOTALL modifier for the . to match any char incl. a newline). See the regex demo

Pattern details:

  • (@article.*?\n[ \t]*}[ \t]*) - Group 1 capturing a sequence of:
    • @article - a literal text @article
    • .*? - any zero or more chars, as few as possible, up to the first...
    • \n[ \t]*}[ \t]* - newline, followed with 0+ spaces/tabs, } and again 0+ spaces/tabs and...
  • (?:\n|$) - either a newline (\n) or end of string ($).

Python demo:

import re
p = re.compile(r'(@article.*?\n[ \t]*}[ \t]*)(?:\n|$)', re.DOTALL)
s = "@article{Fu_2007_ssr,\ndoi = {10.1016/j.surfrep.2007.07.001}\n}\n\n@article{Shibuya_2007_apl,\n doi = {10.1063/1.2816907}\n}"
print(p.findall(s))
# => ['@article{Fu_2007_ssr,\ndoi = {10.1016/j.surfrep.2007.07.001}\n}',
#     '@article{Shibuya_2007_apl,\n doi = {10.1063/1.2816907}\n}']

Note that unrolling the pattern as

@article.*(?:\n(?![ \t]*}[ \t]*(?:\n|$)).*)*\s*}

will make it more robust. See another regex demo and a Python demo (this regex does not require a re.DOTALL modifier).

Sign up to request clarification or add additional context in comments.

5 Comments

See the unrolled version - it should not be used with re.DOTALL modifier, and will be more efficient.
Your code works. So re.findall will automatically put the matched patterns into a list?
Yes, re.findall puts either the matches (if no capturing groups are defined in the pattern), or the captured values (as list of tuples) if capture groups are defined.
It turns out r'(@article.*?\n[ \t]*})' also works in my case. Do you think it is robust enough?
I think you are using it with a re.DOTALL. The lazy dot matching pattern is not that efficient as an unrolled pattern, since the unrolled pattern is grabbing parts of string in chunks, and the .*? expands at each location in the string. If you compare it with the first pattern, it is less precise since it does not check if the trailing } is the only character on the line (my patterns check that).
1

Alternatively, you could use bibtexparser which saves you all the trouble:

>>> import bibtexparser
>>> bib_file = """
... @article{Fu_2007_ssr,
... doi = {10.1016/j.surfrep.2007.07.001}
... }
...
... @article{Shibuya_2007_apl,
...  doi = {10.1063/1.2816907}
... }
... """
>>> b = bibtexparser.loads(bib_file)
>>> b.entries
[{'ENTRYTYPE': 'article', 'ID': 'Fu_2007_ssr', 'doi': '10.1016/j.surfrep.2007.07.001'}, {'ENTRYTYPE': 'article', 'ID': 'Shibuya_2007_apl', 'doi': '10.1063/1.2816907'}]

There, you have a list containing the items from the bib file properly splitted and mapped to their bib titles.

6 Comments

However, the expected output is different from what OP needs.
@WiktorStribiżew I figure they'll end up needing to split the contents of article again maybe with another regex. This library already deals with that
It is not a list, it's a dictionary, which lacks any information about the order of the fields.
@AlwaysLearning That's a list of dictionaries, with each dictionary appearing in the order from the text.
@MosesKoledoye That's what I meant. Each dictionary lacks information about the order of the fields.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.