Python regex - find patterns in a file and put them in a list

Question

I am trying to use regex to find all the matched patterns in a BibTex file. The file looks like this:

bib_file = """
@article{Fu_2007_ssr,
doi = {10.1016/j.surfrep.2007.07.001}
}

@article{Shibuya_2007_apl,
 doi = {10.1063/1.2816907}
}
"""

My goal is to find all the matched patterns with is from @article to } and put these patterns into a list. So my final list will be like this:

['@article{Fu_2007_ssr,\n  doi = {10.1016/j.surfrep.2007.07.001}\n   }',
 '@article{Shibuya_2007_apl,\n  doi = {10.1063/1.2816907}\n    }']

Currently, I have my code:

    rx_sequence = re.compile(r'(@article(.*)}\n)', re.DOTALL)
    article = rx_sequence.search(bib_file).group(1)

But the article is a string, how can I find each matched pattern and append it to a list?

articles= list(rx_sequence.finditer(bib_file))?

Aran-Fey
– Aran-Fey

2016-08-15 17:03:35 +00:00
Commented Aug 15, 2016 at 17:03 — Aran-Fey
– Aran-Fey, Commented Aug 15, 2016 at 17:03
@Rawing Just tried that. Doesn't seem to work though

Jianli Cheng
– Jianli Cheng

2016-08-15 17:06:04 +00:00
Commented Aug 15, 2016 at 17:06 — Jianli Cheng
– Jianli Cheng, Commented Aug 15, 2016 at 17:06
Why not use a Python bibtexparser?

Moses Koledoye
– Moses Koledoye

2016-08-15 17:07:53 +00:00
Commented Aug 15, 2016 at 17:07 — Moses Koledoye
– Moses Koledoye, Commented Aug 15, 2016 at 17:07
@MosesKoledoye: Thanks for letting me know this package.

Jianli Cheng
– Jianli Cheng

2016-08-15 17:17:06 +00:00
Commented Aug 15, 2016 at 17:17 — Jianli Cheng
– Jianli Cheng, Commented Aug 15, 2016 at 17:17

Wiktor Stribiżew · Accepted Answer · 2016-08-15 17:13:48Z

1

You can match all these articles with

r"(@article.*?\n[ \t]*}[ \t]*)(?:\n|$)"

(to be used with re.DOTALL modifier for the . to match any char incl. a newline). See the regex demo

Pattern details:

(@article.*?\n[ \t]*}[ \t]*) - Group 1 capturing a sequence of:
- @article - a literal text @article
- .*? - any zero or more chars, as few as possible, up to the first...
- \n[ \t]*}[ \t]* - newline, followed with 0+ spaces/tabs, } and again 0+ spaces/tabs and...
(?:\n|$) - either a newline (\n) or end of string ($).

Python demo:

import re
p = re.compile(r'(@article.*?\n[ \t]*}[ \t]*)(?:\n|$)', re.DOTALL)
s = "@article{Fu_2007_ssr,\ndoi = {10.1016/j.surfrep.2007.07.001}\n}\n\n@article{Shibuya_2007_apl,\n doi = {10.1063/1.2816907}\n}"
print(p.findall(s))
# => ['@article{Fu_2007_ssr,\ndoi = {10.1016/j.surfrep.2007.07.001}\n}',
#     '@article{Shibuya_2007_apl,\n doi = {10.1063/1.2816907}\n}']

Note that unrolling the pattern as

@article.*(?:\n(?![ \t]*}[ \t]*(?:\n|$)).*)*\s*}

will make it more robust. See another regex demo and a Python demo (this regex does not require a re.DOTALL modifier).

edited Aug 15, 2016 at 17:13

answered Aug 15, 2016 at 17:06

Wiktor Stribiżew

631k41 gold badges502 silver badges633 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Wiktor Stribiżew Over a year ago

See the unrolled version - it should not be used with re.DOTALL modifier, and will be more efficient.

Jianli Cheng Over a year ago

Your code works. So re.findall will automatically put the matched patterns into a list?

Wiktor Stribiżew Over a year ago

Yes, re.findall puts either the matches (if no capturing groups are defined in the pattern), or the captured values (as list of tuples) if capture groups are defined.

Jianli Cheng Over a year ago

It turns out r'(@article.*?\n[ \t]*})' also works in my case. Do you think it is robust enough?

Wiktor Stribiżew Over a year ago

I think you are using it with a re.DOTALL. The lazy dot matching pattern is not that efficient as an unrolled pattern, since the unrolled pattern is grabbing parts of string in chunks, and the .*? expands at each location in the string. If you compare it with the first pattern, it is less precise since it does not check if the trailing } is the only character on the line (my patterns check that).

Moses Koledoye · Accepted Answer · 2016-08-15 17:18:37Z

1

Alternatively, you could use bibtexparser which saves you all the trouble:

>>> import bibtexparser
>>> bib_file = """
... @article{Fu_2007_ssr,
... doi = {10.1016/j.surfrep.2007.07.001}
... }
...
... @article{Shibuya_2007_apl,
...  doi = {10.1063/1.2816907}
... }
... """
>>> b = bibtexparser.loads(bib_file)
>>> b.entries
[{'ENTRYTYPE': 'article', 'ID': 'Fu_2007_ssr', 'doi': '10.1016/j.surfrep.2007.07.001'}, {'ENTRYTYPE': 'article', 'ID': 'Shibuya_2007_apl', 'doi': '10.1063/1.2816907'}]

There, you have a list containing the items from the bib file properly splitted and mapped to their bib titles.

answered Aug 15, 2016 at 17:18

Moses Koledoye

78.8k8 gold badges139 silver badges141 bronze badges

6 Comments

Wiktor Stribiżew Over a year ago

However, the expected output is different from what OP needs.

Moses Koledoye Over a year ago

@WiktorStribiżew I figure they'll end up needing to split the contents of article again maybe with another regex. This library already deals with that

AlwaysLearning Over a year ago

It is not a list, it's a dictionary, which lacks any information about the order of the fields.

Moses Koledoye Over a year ago

@AlwaysLearning That's a list of dictionaries, with each dictionary appearing in the order from the text.

AlwaysLearning Over a year ago

@MosesKoledoye That's what I meant. Each dictionary lacks information about the order of the fields.

|

Collectives™ on Stack Overflow

Python regex - find patterns in a file and put them in a list

2 Answers 2

5 Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related