Python string processing optimization

Question

So lately I've been making a python script for extracting data from a large text files ( > 1 GB ). The problem basically sums up to selecting lines of text from the file, and searching them for strings from some array ( this array can have as many as 1000 strings in it). The problem here is that i have to find a specific occurrence of the string, and the string may appear unlimited number of times in that file. Also, some decoding and encoding is required, which additionally slows the script down. Code looks something like this:

strings = [a for a in open('file.txt')]

with open("er.txt", "r") as f:
    for chunk in f:
        for s in strings
            #do search, trimming, stripping ..

My question here is: Is there a way to optimize this? I tried multiprocessing, but it helps little ( or at least the way i implemented it ) The problem here is that these chunk operations aren't independent and strings list may be altered during one of them. Any optimization would help (string search algorithms, file reading etc.) I did as much as i could regarding loop breaking, but it still runs pretty slow.

Antti Haapala · Accepted Answer · 2020-05-20 08:12:09Z

5

If you can know exactly how the string is encoded in binary (ASCII, UTF-8), you can mmap the entire file into memory at a time; it would behave exactly like a large bytearray/bytes (or str in Python 2) obtained by file.read() would; then such a mmap object would be searchable by a str regular expression (Python 2), or bytes regular expression (Python 3).

The mmap is the fastest solution on many operating systems, because the read-only mapping means that the OS can freely map in the pages as they're ready; no swap space is required, because the data is backed by a file. The OS can also directly map the data from the buffer cache with zero copying - thus a win-win-win over bare reading.

Example:

import mmap
import re

pattern = re.compile(b'the ultimate answer is ([0-9]+)')
with open("datafile.txt", "rb") as f:
    # memory-map the file, size 0 means whole file
    mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)

    # PROT_READ only on *nix as the file is not writable
    for match in pattern.finditer(mm):
        # process match
        print("The answer is {}".format(match.group(1).decode('utf8')))

    mm.close()

Now, if the datafile.txt contained the text:

the ultimate answer is 42

somewhere along the 1 gigabyte of data, this program would be among the fastest python solutions to spit out:

The answer is 42

Notice that pattern.finditer also accepts start and end parameters that can used to limit the range where the match is attempted.

As noted by ivan_pozdeev, this requires 1 gigabyte of free virtual address space for mapping a gigabyte file (but not necessarily 1 gigabyte of RAM), which might be difficult in a 32-bit process but can almost certainly be assumed a "no-problem" on 64-bit operating system and CPUs. On 32-bit processes the approach still works, but you need to map big files in smaller chunks - thus now the bits of the operating system and processor truly matter.

edited May 20, 2020 at 8:12

answered Feb 21, 2015 at 10:20

Antti Haapala

135k23 gold badges298 silver badges349 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

degeneration Over a year ago

Wow. This did an amazing job :D You rock!!

ivan_pozdeev Over a year ago

Note that you have to have a large enough chunk of free, contiguous address space. Getting a chunk of 1GB is a problem in x32, stackoverflow.com/questions/19016512/c-mapviewoffile-fails is a live example.

Antti Haapala Over a year ago

Yes, if you are doing big data analysis, better to get a 64-bit computer, 64-bit operating system and 64-bit Python :)

jhermann · Accepted Answer · 2015-02-21 09:43:24Z

1

Think about calling an external process (grep and the like) to speed up processing and reducing the data volume you have to handle within Python.

Another route to go would be to filter or pre-filter your data with a compiled regex, since then your inner loop uses the optimized code of the standard library.

You could also try Cython or similar for the hot inner loops, see e.g. https://books.google.de/books?id=QSsOBQAAQBAJ&dq=high+perf+python&hl=en for details on that.

answered Feb 21, 2015 at 9:43

jhermann

2,11115 silver badges18 bronze badges

Collectives™ on Stack Overflow

Python string processing optimization

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related