1

Usually when we search, we have a list of stories, we provide a search string, and expect back a list of results where the given search strings matches the story.

What I am looking to do, is the opposite. Give a list of search strings, and one story and find out which search strings match to that story.

Now this could be done with re but the case here is i wanna use complex search queries as supported by solr. Full details of the query syntax here. Note: i wont use boost.

Basically i want to get some pointers for the doesitmatch function in the sample code below.

def doesitmatch(contents, searchstring):
    """
    returns result of searching contents for searchstring (True or False)
    """
    ???????
    ???????


story = "big chunk of story 200 to 1000 words long"
searchstrings = ['sajal' , 'sajal AND "is a jerk"' , 'sajal kayan' , 'sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python))' , 'bangkok']

matches = [[searchstr] for searchstr in searchstrings if doesitmatch(story, searchstr) ]

Edit: Additionally would also be interested to know if any module exists to convert lucene query like below into regex:

sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python) OR "is a jerk")

6 Answers 6

2

After extensive googling, i realized what i am looking to do is a Boolean search.

Found the code that makes regex boolean aware : http://code.activestate.com/recipes/252526/

Issue looks solved for now.

Sign up to request clarification or add additional context in comments.

Comments

0

Probably slow, but easy solution:

Make a query on the story plus each string to the search engine. If it returns anything, then it matches.

Otherwise you need to implement the search syntax yourself. If that includes things like "title:" and stuff this can be rather complex. If it's only the AND and OR from your example, then it's a recursive function that isn't too hairy.

1 Comment

The problem on using my search engine(solr) for this is that the in code above the list searchstrings would have 10,000s of phrases. It would not be ideal to hit the search server 10,000s of times per story. would be very expensive. Im not using any complex stuff, only: AND, OR, Quotes and Brackets Im thinking of writing a function to convert it to regex, but given my limited regex skills i thought to investigate if such a function already exists in python...
0

Some time ago I looked for a python implementaion of lucene and I came accross of Woosh which is a pure python text-based research engine. Maybe it will statisfy your needs.

You can also try pyLucene, but i did'nt investigate this one.

Comments

0

Here's a suggestion in pseudocode. I'm assuming you store a story identifier with the search terms in the index, so that you can retrieve it with the search results.

def search_strings_matching(story_id_to_match, search_strings):
    result = set()
    for s in search_strings:
        result_story_ids = query_index(s) # query_index returns an id iterable
        if story_id_to_match in result_story_ids:
            result.add(s)
    return result 

3 Comments

The problem is that my index is solr running on another server, and search_strings would have over 10,000+ terms in it. running so many queries would be expensive in terms of time and resources.
How often do the search strings change?
several times a day (not yet fully decided its upcoming project) ... but > 1ce/hour
0

This is probably less interesting to you now, since you've already solved your problem, but what you're describing sounds like Prospective Search, which is what you call it when you have the query first and you want to match it against documents as they come along.

Lucene's MemoryIndex is a class that was designed specifically for something like this, and in your case it might be efficient enough to run many queries against a single document.

This has nothing to do with Python, though. You'd probably be better off writing something like this in java.

2 Comments

looks interesting, i am already using solr(lucene based) for some stuff, ill see if it can be made into using this. The reason id prefer it to be in python is because im using it within a django project. moreover i cant even write hello world in java :)
I know this is an old question/comment, but people reading this may be interested in the fact that ElasticSearch provides this with its Percolation feature.
0

If you are writing Python on AppEngine, you can use the AppEngine Prospective Search Service to achieve exactly what you are trying to do here. See: http://code.google.com/appengine/docs/python/prospectivesearch/overview.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.