Python big file parsing with re

Question

How to parse a large file with regular expressions (using the re module), without loading the whole file in string (or memory)? Memory mapped files don't help because their content can't be converted to some kind of lazy string. The re module only supports string as content argument.

#include <boost/format.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/regex.hpp>
#include <iostream>

int main(int argc, char* argv[])
{
    boost::iostreams::mapped_file fl("BigFile.log");
    //boost::regex expr("\\w+>Time Elapsed .*?$", boost::regex::perl);
    boost::regex expr("something usefull");
    boost::match_flag_type flags = boost::match_default;
    boost::iostreams::mapped_file::iterator start, end;
    start = fl.begin();
    end = fl.end();
    boost::match_results<boost::iostreams::mapped_file::iterator> what;
    while(boost::regex_search(start, end, what, expr))
    {
        std::cout<<what[0].str()<<std::endl;
        start = what[0].second;
    }
    return 0;
}

To demonstrate my requirements. I wrote a short sample using C++(and boost) the same I want to have in Python.

Unless you need multiline regex, parse the file line by line. — Lenna
– Lenna, Commented Jul 26, 2012 at 17:06
Perhaps if you rephrased the question as to what you have, and what you want to achieve, it'd give us a better opportunity to make suggestions - unless you're adamant to a particular approach. — Jon Clements
– Jon Clements, Commented Jul 26, 2012 at 17:08

Alex · Accepted Answer · 2012-07-27 16:44:15Z

8

Everything now works ok(Python 3.2.3 has some differences with Python 2.7 in interface). Search patter should be just prefixed with b" to have a working solution(in Python 3.2.3).

import re
import mmap
import pprint

def ParseFile(fileName):
    f = open(fileName, "r")
    print("File opened succesfully")
    m = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ)
    print("File mapped succesfully")
    items = re.finditer(b"\\w+>Time Elapsed .*?\n", m)
    for item in items:
        pprint.pprint(item.group(0))

if __name__ == "__main__":
    ParseFile("testre")

answered Jul 27, 2012 at 16:44

Alex

3601 gold badge3 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Rotareti Over a year ago

This is neat, since it allows usage of multi-line regular expressions.

Julian · Accepted Answer · 2012-07-26 17:06:45Z

6

It depends on what sort of parsing you're doing.

If the parsing you're doing is linewise, you can iterate over the lines of a file with:

with open("/some/path") as f:
    for line in f:
        parse(line)

otherwise, you'll need to use something like chunking, by reading chunks at a time and parsing them. Obviously this would involve being much more careful in case what you're trying to match overlaps with chunk boundaries.

answered Jul 26, 2012 at 17:06

Julian

3,44918 silver badges29 bronze badges

1 Comment

Alex Over a year ago

Thanks by I'm searching for patterns in stream without checking line boundaries.

algorowara · Accepted Answer · 2012-07-26 17:15:48Z

1

To elaborate on Julian's solution, you could achieve chunking (if you want to do multiline regexes) by storing and concatenating consecutive lines, like so:

list_prev_lines = []
for i in range(N):
    list_prev_lines.append(f.readline())
for line in f:
    list_prev_lines.pop(0)
    list_prev_lines.append(line)
    parse(string.join(list_prev_lines))

This will keep a running list of the previous N lines, the current line included, and then parse the multi-line group as a single string.

answered Jul 26, 2012 at 17:15

algorowara

1,7201 gold badge15 silver badges15 bronze badges

1 Comment

Alex Over a year ago

Yes but I don't know how much lines would be needed(in general) and actually this case is just sub case read whole file to memory. Instead I would like to have general solution using memory mapped files(because of the easy of use and good efficiency).

Collectives™ on Stack Overflow

Python big file parsing with re

3 Answers 3

1 Comment

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related