1

I have a human dictionary file that looks like this in eng.dic (image that there is close to a billion words in that list). And I have to run different word queries quite often.

apple
pear
foo
bar
foo bar
dictionary
sentence

I have a string let's say "foo-bar", is there a better (more efficient way) of searching through that file to see whether it exist, if it return exist, if it doesnt exist, append the dictionary file

dic_file = open('en_dic', 'ra', 'utf8')
query = "foo-bar"
wordlist = list(dic_file.readlines().replace(" ","-"))
en_dic = map(str.strip, wordlist)

if query in en_dic:
    return 1
else:
    print>>dic_file, query

Is there any in-built search functions in python? or any libraries that i can import to run such searches without much overheads?

10
  • 1
    I doubt you'd be able to do better than an implementation like the one you have if you are just doing this with one word. But if you were going to loop through and perform this function many times, you could potentially store the strings in a way that allowed more efficient lookup. A very simple example would be keeping the list sorted. Commented Sep 17, 2012 at 6:03
  • a billion words? really? you will run out of english words at about a million .. Commented Sep 17, 2012 at 6:08
  • @wim, not true. Consider "foo" as 1 word and "bar" as 1 word and "foo bar" as a different word. So the word list is pretty much limitless in some sense, but restricted to what data input i have and currently it's at a billion word corpus, so i've listed the worse case scenario. Commented Sep 17, 2012 at 6:12
  • Can you change the representation? A shelve, perhaps, or a sqlite3 database? Commented Sep 17, 2012 at 6:12
  • 1
    @2er0: The point is the problem, although looking as different, is very similar (I would even say his problem was more complex, but to solve your problem, you need to use the same solution as a base). Going through the file every time you want to check existence of something is not a good idea, unless you know what you are doing. If you will store that in the database, you will get much more flexible and efficient solution (this will be stored also in the file, but you will be able to use SQLite efficient mechanism). Just index the file and use database for checks. Commented Sep 17, 2012 at 6:26

3 Answers 3

2

As I already mentioned, going through the whole file when its size is significant, is not a good idea. Instead you should use established solutions and:

  1. index the words in the document,
  2. store the results of indexing in appropriate form (I suggest database),
  3. check if the word exists in the file (by checking the database),
  4. if it does not exist, add it to file and database,

Storing data in database is really a lot more efficient than trying to reinvent the wheel. If you will use SQLite, the database will be also a file, so the setup procedure is minimal.

So again, I am proposing storing words in SQLite database and querying when you want to check if the word exists in the file, then updating it if you are adding it.

To read more on the solution see answers to this question:

The most efficient way to index words in a document

Sign up to request clarification or add additional context in comments.

Comments

0

Most efficient way depends on most frequent operation that you will perform with this dictionary.

If you need to read file each time, you can use while loop reading file line-by-line until result is your word on end of the file. This is necessary if you have several concurrent workers that can update file at the same time.

If you don't need to read file each time (eg, you have only one process that work with dictionary), you can definitely write more efficient implementation: 1) read all lines into set (instead of list), 2) for each "new" word perform both actions - update set with add operation and write word to file.

Comments

0

If it is "pretty large" file, then access the lines sequentially and don't read the whole file into memory:

with open('largeFile', 'r') as inF:
 for line in inF:
    if 'myString' in line:
        # do_something

1 Comment

but i have to access the dictionary quite often, so sequential search is certainly out of the option.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.