6

How to parse a large file with regular expressions (using the re module), without loading the whole file in string (or memory)? Memory mapped files don't help because their content can't be converted to some kind of lazy string. The re module only supports string as content argument.

#include <boost/format.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/regex.hpp>
#include <iostream>

int main(int argc, char* argv[])
{
    boost::iostreams::mapped_file fl("BigFile.log");
    //boost::regex expr("\\w+>Time Elapsed .*?$", boost::regex::perl);
    boost::regex expr("something usefull");
    boost::match_flag_type flags = boost::match_default;
    boost::iostreams::mapped_file::iterator start, end;
    start = fl.begin();
    end = fl.end();
    boost::match_results<boost::iostreams::mapped_file::iterator> what;
    while(boost::regex_search(start, end, what, expr))
    {
        std::cout<<what[0].str()<<std::endl;
        start = what[0].second;
    }
    return 0;
}

To demonstrate my requirements. I wrote a short sample using C++(and boost) the same I want to have in Python.

2
  • 2
    Unless you need multiline regex, parse the file line by line. Commented Jul 26, 2012 at 17:06
  • 2
    Perhaps if you rephrased the question as to what you have, and what you want to achieve, it'd give us a better opportunity to make suggestions - unless you're adamant to a particular approach. Commented Jul 26, 2012 at 17:08

3 Answers 3

8

Everything now works ok(Python 3.2.3 has some differences with Python 2.7 in interface). Search patter should be just prefixed with b" to have a working solution(in Python 3.2.3).

import re
import mmap
import pprint

def ParseFile(fileName):
    f = open(fileName, "r")
    print("File opened succesfully")
    m = mmap.mmap(f.fileno(), 0, access = mmap.ACCESS_READ)
    print("File mapped succesfully")
    items = re.finditer(b"\\w+>Time Elapsed .*?\n", m)
    for item in items:
        pprint.pprint(item.group(0))

if __name__ == "__main__":
    ParseFile("testre")
Sign up to request clarification or add additional context in comments.

1 Comment

This is neat, since it allows usage of multi-line regular expressions.
6

It depends on what sort of parsing you're doing.

If the parsing you're doing is linewise, you can iterate over the lines of a file with:

with open("/some/path") as f:
    for line in f:
        parse(line)

otherwise, you'll need to use something like chunking, by reading chunks at a time and parsing them. Obviously this would involve being much more careful in case what you're trying to match overlaps with chunk boundaries.

1 Comment

Thanks by I'm searching for patterns in stream without checking line boundaries.
1

To elaborate on Julian's solution, you could achieve chunking (if you want to do multiline regexes) by storing and concatenating consecutive lines, like so:

list_prev_lines = []
for i in range(N):
    list_prev_lines.append(f.readline())
for line in f:
    list_prev_lines.pop(0)
    list_prev_lines.append(line)
    parse(string.join(list_prev_lines))

This will keep a running list of the previous N lines, the current line included, and then parse the multi-line group as a single string.

1 Comment

Yes but I don't know how much lines would be needed(in general) and actually this case is just sub case read whole file to memory. Instead I would like to have general solution using memory mapped files(because of the easy of use and good efficiency).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.