1

I am using this python library that implements the Aho-Corasick string searching algorithm that finds a set of patterns in a given string in one pass. The output is not what I am expecting:

In [4]: import ahocorasick
In [5]: import collections

In [6]: tree = ahocorasick.KeywordTree()

In [7]: ss = "this is the first sentence in this book the first sentence is really the most interesting the first sentence is always first"

In [8]: words = ["first sentence is", "first sentence", "the first sentence", "the first sentence is"]

In [9]: for w in words:
   ...:     tree.add(w)
   ...:

In [10]: tree.make()

In [13]: final = collections.defaultdict(int)

In [15]: for match in tree.findall(ss, allow_overlaps=True):
   ....:     final[ss[match[0]:match[1]]] += 1
   ....:

In [16]: final
{   'the first sentence': 3, 'the first sentence is': 2}

The output I was expecting was this:

{ 
  'the first sentence': 3,
  'the first sentence is': 2,
  'first sentence': 3,
  'first sentence is': 2
}

Am I missing something? I am doing this on large strings so post processing is not my first option. Is there a way to get the desired output?

2 Answers 2

1

I don't know about the ahocorasick module, but those results seem suspect. The acora module shows this:

import acora
import collections

ss = "this is the first sentence in this book "
     "the first sentence is really the most interesting "
     "the first sentence is always first"

words = ["first sentence is", 
         "first sentence",
         "the first sentence",
         "the first sentence is"]

tree = acora.AcoraBuilder(*words).build()

for match in tree.findall(ss):
    result[match] += 1

Results:

>>> result
defaultdict(<type 'int'>, 
            {'the first sentence'   : 3,
             'first sentence'       : 3,
             'first sentence is'    : 2,
             'the first sentence is': 2})
Sign up to request clarification or add additional context in comments.

4 Comments

+1 Thank you. This agrees with my desired output. Do you have any experience with using this for large text by any chance? I mean, performance wise.
No direct experience with a large corpus, sorry. The PyPi page says it frees the GIL, and includes a 'fast' CPython' implementation, but beyond that I don't know.
Oh, you might also try esmre, culled from here.
Thank you. I can test them now :) Appreciate your help.
1

The way I understand the Aho-Corasick algorithm and the way I've implemented it would have me agree with your expected output. It looks like the Python library you're using is in error, or perhaps there's a flag that you can tell it to give you all matches starting at a position rather than just the longest match starting at a particular position.

The examples in the original paper, http://www.win.tue.nl/~watson/2R080/opdracht/p333-aho-corasick.pdf, support your understanding.

1 Comment

+1 Thank you for the informative answer. I was afraid this might be the case. That library seems to be pretty heavily used so I was just wondering why no one caught it before.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.