How do I perform binary search on a text file to search a keyword in python?

Question

The text file contains two columns- index number(5 spaces) and characters(30 spaces). It is arranged in lexicographic order. I want to perform binary search to search for the keyword.

Is the line length constant for each line? You mention "spaces". Do you mean spaces between the values or "5 characters for the index number" and "30 characters for the data"? — Joachim Sauer
– Joachim Sauer, Commented Mar 7, 2011 at 9:03
yes..the line length is constant for each line...I mean the latter.. "5 characters for the index number" and "30 characters for the data"? — Apps
– Apps, Commented Mar 7, 2011 at 10:19
By what are the said "columns" separated ? This question because kriegar speaks of separating space. What is there between the two columns ? Are they touching without space ? What do you want to do after the keyword is found ? — eyquem
– eyquem, Commented Mar 7, 2011 at 11:04
Can you load the entire file into memory, or is it too large for that? — Björn Pollex
– Björn Pollex, Commented Mar 7, 2011 at 11:57

Mahmoud Abdelkader · Accepted Answer · 2011-03-07 11:56:02Z

14

Here's an interesting way to do it with Python's built-in bisect module.

import bisect
import os


class Query(object):

    def __init__(self, query, index=5):
        self.query = query
        self.index = index

    def __lt__(self, comparable):
        return self.query < comparable[self.index:]


class FileSearcher(object):

    def __init__(self, file_pointer, record_size=35):
        self.file_pointer = file_pointer
        self.file_pointer.seek(0, os.SEEK_END)
        self.record_size = record_size + len(os.linesep)
        self.num_bytes = self.file_pointer.tell()
        self.file_size = (self.num_bytes // self.record_size)

    def __len__(self):
        return self.file_size

    def __getitem__(self, item):
        self.file_pointer.seek(item * self.record_size)
        return self.file_pointer.read(self.record_size)


if __name__ == '__main__':
    with open('data.dat') as file_to_search:
        query = raw_input('Query: ')
        wrapped_query = Query(query)

        searchable_file = FileSearcher(file_to_search)
        print "Located @ line: ", bisect.bisect(searchable_file, wrapped_query)

edited Mar 7, 2011 at 11:56

answered Mar 7, 2011 at 11:43

Mahmoud Abdelkader

25.3k5 gold badges44 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

jfs Over a year ago

+1: it doesn't require to load the whole file in memory and binary search is justified in this case.

tzot Over a year ago

+1; if each “record” contains a newline too, the proposed record_size should be 36.

thebigjc Over a year ago

This works great - note that you get the last record rather than the first. If you want the first use bisect_left. You will also have to change the wrapper class to wrap the data in the file as it will switch the comparison

samplebias · Accepted Answer · 2011-03-07 11:39:28Z

6

Do you need do do a binary search? If not, try converting your flatfile into a cdb (constant database). This will give you very speedy hash lookups to find the index for a given word:

import cdb

# convert the corpus file to a constant database one time
db = cdb.cdbmake('corpus.db', 'corpus.db_temp')
for line in open('largecorpus.txt', 'r'):
    index, word = line.split()
    db.add(word, index)
db.finish()

In a separate script, run queries against it:

import cdb
db = cdb.init('corpus.db')
db.get('chaos')
12345

answered Mar 7, 2011 at 11:39

samplebias

38k6 gold badges110 silver badges104 bronze badges

1 Comment

thebigjc Over a year ago

Note that CDB doesn't support files greater than 4GB, which killed this option for me. Given modern architectures, if the file is < 4GB, you might as well just have it in RAM.

Community · Accepted Answer · 2017-05-23 12:08:08Z

If you need to find a single keyword in a file:

line_with_keyword = next((line for line in open('file') if keyword in line),None)
if line_with_keyword is not None: 
   print line_with_keyword # found

To find multiple keywords you could use set() as @kriegar suggested:

def extract_keyword(line):
    return line[5:35] # assuming keyword starts on 6 position and has length 30

with open('file') as f:
    keywords = set(extract_keyword(line) for line in f) # O(n) creation
    if keyword in keywords: # O(1) search
       print(keyword)

You could use dict() above instead of set() to preserve index information.

Here's how you could do a binary search on a text file:

import bisect

lines = open('file').readlines() # O(n) list creation
keywords = map(extract_keyword, lines) 
i = bisect.bisect_left(keywords, keyword) # O(log(n)) search
if keyword == keywords[i]:
   print(lines[i]) # found

There is no advantage compared to the set() variant.

Note: all variants except the first one load the whole file in memory. FileSearcher() suggested by @Mahmoud Abdelkader don't require to load the whole file in memory.

MrTomRod · Accepted Answer · 2020-03-10 07:35:35Z

1

I wrote a simple Python 3.6+ package that can do this. (See its github page for more information!)

Installation: pip install binary_file_search

Example file:

1,one
2,two_a
2,two_b
3,three

Usage:

from binary_file_search.BinaryFileSearch import BinaryFileSearch
with BinaryFileSearch('example.file', sep=',', string_mode=False) as bfs:
    # assert bfs.is_file_sorted()  # test if the file is sorted.
    print(bfs.search(2))

Result: [[2, 'two_a'], [2, 'two_b']]

edited Mar 10, 2020 at 7:35

answered Mar 10, 2020 at 7:22

MrTomRod

3642 silver badges16 bronze badges

Comments

user3101161 · Accepted Answer · 2013-12-20 21:52:05Z

It is quite possible, with a slight loss of efficiency to perform a binary search on a sorted text file with records of unknown length, by repeatedly bisecting the range, and reading forward past the line terminator. Here's what I do to look for look thru a csv file with 2 header lines for a numeric in the first field. Give it an open file, and the first field to look for. It should be fairly easy to modify this for your problem. A match on the very first line at offset zero will fail, so this may need to be special-cased. In my circumstance, the first 2 lines are headers, and are skipped.

Please excuse my lack of polished python below. I use this function, and a similar one, to perform GeoCity Lite latitude and longitude calculations directly from the CSV files distributed by Maxmind.

Hope this helps

========================================

# See if the input loc is in file 
def look1(f,loc):
# Compute filesize of open file sent to us
hi = os.fstat(f.fileno()).st_size
lo=0
lookfor=int(loc)
# print "looking for: ",lookfor
while hi-lo > 1:
    # Find midpoint and seek to it
    loc = int((hi+lo)/2)
    # print " hi = ",hi," lo = ",lo
    # print "seek to: ",loc
    f.seek(loc)
    # Skip to beginning of line
    while f.read(1) != '\n':
        pass
    # Now skip past lines that are headers
    while 1:
        # read line
        line = f.readline()
        # print "read_line: ", line
        # Crude csv parsing, remove quotes, and split on ,
        row=line.replace('"',"")
        row=row.split(',')
        # Make sure 1st fields is numeric
        if row[0].isdigit():
            break
    s=int(row[0])
    if lookfor < s:
        # Split into lower half
        hi=loc
        continue
    if lookfor > s:
        # Split into higher half
        lo=loc
        continue
    return row  # Found
# If not found
return False

dting · Accepted Answer · 2011-03-08 09:47:49Z

-1

Consider using a set instead of a binary search for finding a keyword in your file.

Set:

O(n) to create, O(1) to find, O(1) to insert/delete

If your input file is separated by a space then:

f = open('file')
keywords = set( (line.strip().split(" ")[1] for line in f.readlines()) )
f.close()    

my_word in keywords
<returns True or False>

Dictionary:

f = open('file')
keywords = dict( [ (pair[1],pair[0]) for pair in  [line.strip().split(" ") for line in f.readlines()] ] ) 
f.close()

keywords[my_word]
<returns index of my_word>

Binary Search is:

O(n log n) create, O(log n) lookup

edit: for your case of 5 characters and 30 characters you can just use string slicing

f = open('file')
keywords = set( (line[5:-1] for line in f.readlines()) )
f.close()

myword_ in keywords

or 

f = open('file')
keywords = dict( [(line[5:-1],line[:5]) for line in f.readlines()] )
f.close()

keywords[my_word]

edited Mar 8, 2011 at 9:47

answered Mar 7, 2011 at 9:53

dting

39.4k10 gold badges98 silver badges117 bronze badges

4 Comments

Apps Over a year ago

Thank you for the help! I tried to run this code but the list "keywords" has no elements . Its NULL.

dting Over a year ago

@Apps check my edit. Also check if the file is being correctly opened etc. for line in f.readlines(): print line

Apps Over a year ago

hey but isn't O(n) costlier than O(log n)?? I need to run this on a huge corpus.

dting Over a year ago

To find key in the dictionary or check if object is in the set it is O(1), O(n) is the creation of the set or dict, A binary tree is O(n log n) for creation.

Collectives™ on Stack Overflow

How do I perform binary search on a text file to search a keyword in python?

6 Answers 6

3 Comments

1 Comment

Comments

Comments

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

1 Comment

Comments

Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related