12

The text file contains two columns- index number(5 spaces) and characters(30 spaces). It is arranged in lexicographic order. I want to perform binary search to search for the keyword.

4
  • Is the line length constant for each line? You mention "spaces". Do you mean spaces between the values or "5 characters for the index number" and "30 characters for the data"? Commented Mar 7, 2011 at 9:03
  • yes..the line length is constant for each line...I mean the latter.. "5 characters for the index number" and "30 characters for the data"? Commented Mar 7, 2011 at 10:19
  • By what are the said "columns" separated ? This question because kriegar speaks of separating space. What is there between the two columns ? Are they touching without space ? What do you want to do after the keyword is found ? Commented Mar 7, 2011 at 11:04
  • Can you load the entire file into memory, or is it too large for that? Commented Mar 7, 2011 at 11:57

6 Answers 6

14

Here's an interesting way to do it with Python's built-in bisect module.

import bisect
import os


class Query(object):

    def __init__(self, query, index=5):
        self.query = query
        self.index = index

    def __lt__(self, comparable):
        return self.query < comparable[self.index:]


class FileSearcher(object):

    def __init__(self, file_pointer, record_size=35):
        self.file_pointer = file_pointer
        self.file_pointer.seek(0, os.SEEK_END)
        self.record_size = record_size + len(os.linesep)
        self.num_bytes = self.file_pointer.tell()
        self.file_size = (self.num_bytes // self.record_size)

    def __len__(self):
        return self.file_size

    def __getitem__(self, item):
        self.file_pointer.seek(item * self.record_size)
        return self.file_pointer.read(self.record_size)


if __name__ == '__main__':
    with open('data.dat') as file_to_search:
        query = raw_input('Query: ')
        wrapped_query = Query(query)

        searchable_file = FileSearcher(file_to_search)
        print "Located @ line: ", bisect.bisect(searchable_file, wrapped_query)
Sign up to request clarification or add additional context in comments.

3 Comments

+1: it doesn't require to load the whole file in memory and binary search is justified in this case.
+1; if each “record” contains a newline too, the proposed record_size should be 36.
This works great - note that you get the last record rather than the first. If you want the first use bisect_left. You will also have to change the wrapper class to wrap the data in the file as it will switch the comparison
6

Do you need do do a binary search? If not, try converting your flatfile into a cdb (constant database). This will give you very speedy hash lookups to find the index for a given word:

import cdb

# convert the corpus file to a constant database one time
db = cdb.cdbmake('corpus.db', 'corpus.db_temp')
for line in open('largecorpus.txt', 'r'):
    index, word = line.split()
    db.add(word, index)
db.finish()

In a separate script, run queries against it:

import cdb
db = cdb.init('corpus.db')
db.get('chaos')
12345

1 Comment

Note that CDB doesn't support files greater than 4GB, which killed this option for me. Given modern architectures, if the file is < 4GB, you might as well just have it in RAM.
2

If you need to find a single keyword in a file:

line_with_keyword = next((line for line in open('file') if keyword in line),None)
if line_with_keyword is not None: 
   print line_with_keyword # found

To find multiple keywords you could use set() as @kriegar suggested:

def extract_keyword(line):
    return line[5:35] # assuming keyword starts on 6 position and has length 30

with open('file') as f:
    keywords = set(extract_keyword(line) for line in f) # O(n) creation
    if keyword in keywords: # O(1) search
       print(keyword)

You could use dict() above instead of set() to preserve index information.

Here's how you could do a binary search on a text file:

import bisect

lines = open('file').readlines() # O(n) list creation
keywords = map(extract_keyword, lines) 
i = bisect.bisect_left(keywords, keyword) # O(log(n)) search
if keyword == keywords[i]:
   print(lines[i]) # found

There is no advantage compared to the set() variant.

Note: all variants except the first one load the whole file in memory. FileSearcher() suggested by @Mahmoud Abdelkader don't require to load the whole file in memory.

Comments

1

I wrote a simple Python 3.6+ package that can do this. (See its github page for more information!)

Installation: pip install binary_file_search

Example file:

1,one
2,two_a
2,two_b
3,three

Usage:

from binary_file_search.BinaryFileSearch import BinaryFileSearch
with BinaryFileSearch('example.file', sep=',', string_mode=False) as bfs:
    # assert bfs.is_file_sorted()  # test if the file is sorted.
    print(bfs.search(2))

Result: [[2, 'two_a'], [2, 'two_b']]

Comments

0

It is quite possible, with a slight loss of efficiency to perform a binary search on a sorted text file with records of unknown length, by repeatedly bisecting the range, and reading forward past the line terminator. Here's what I do to look for look thru a csv file with 2 header lines for a numeric in the first field. Give it an open file, and the first field to look for. It should be fairly easy to modify this for your problem. A match on the very first line at offset zero will fail, so this may need to be special-cased. In my circumstance, the first 2 lines are headers, and are skipped.

Please excuse my lack of polished python below. I use this function, and a similar one, to perform GeoCity Lite latitude and longitude calculations directly from the CSV files distributed by Maxmind.

Hope this helps

========================================

# See if the input loc is in file 
def look1(f,loc):
# Compute filesize of open file sent to us
hi = os.fstat(f.fileno()).st_size
lo=0
lookfor=int(loc)
# print "looking for: ",lookfor
while hi-lo > 1:
    # Find midpoint and seek to it
    loc = int((hi+lo)/2)
    # print " hi = ",hi," lo = ",lo
    # print "seek to: ",loc
    f.seek(loc)
    # Skip to beginning of line
    while f.read(1) != '\n':
        pass
    # Now skip past lines that are headers
    while 1:
        # read line
        line = f.readline()
        # print "read_line: ", line
        # Crude csv parsing, remove quotes, and split on ,
        row=line.replace('"',"")
        row=row.split(',')
        # Make sure 1st fields is numeric
        if row[0].isdigit():
            break
    s=int(row[0])
    if lookfor < s:
        # Split into lower half
        hi=loc
        continue
    if lookfor > s:
        # Split into higher half
        lo=loc
        continue
    return row  # Found
# If not found
return False

Comments

-1

Consider using a set instead of a binary search for finding a keyword in your file.

Set:

O(n) to create, O(1) to find, O(1) to insert/delete

If your input file is separated by a space then:

f = open('file')
keywords = set( (line.strip().split(" ")[1] for line in f.readlines()) )
f.close()    

my_word in keywords
<returns True or False>

Dictionary:

f = open('file')
keywords = dict( [ (pair[1],pair[0]) for pair in  [line.strip().split(" ") for line in f.readlines()] ] ) 
f.close()

keywords[my_word]
<returns index of my_word>

Binary Search is:

O(n log n) create, O(log n) lookup

edit: for your case of 5 characters and 30 characters you can just use string slicing

f = open('file')
keywords = set( (line[5:-1] for line in f.readlines()) )
f.close()

myword_ in keywords

or 

f = open('file')
keywords = dict( [(line[5:-1],line[:5]) for line in f.readlines()] )
f.close()

keywords[my_word]

4 Comments

Thank you for the help! I tried to run this code but the list "keywords" has no elements . Its NULL.
@Apps check my edit. Also check if the file is being correctly opened etc. for line in f.readlines(): print line
hey but isn't O(n) costlier than O(log n)?? I need to run this on a huge corpus.
To find key in the dictionary or check if object is in the set it is O(1), O(n) is the creation of the set or dict, A binary tree is O(n log n) for creation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.