String searching library's outcome - bug or feature or my coding error?

Question

I am using this python library that implements the Aho-Corasick string searching algorithm that finds a set of patterns in a given string in one pass. The output is not what I am expecting:

In [4]: import ahocorasick
In [5]: import collections

In [6]: tree = ahocorasick.KeywordTree()

In [7]: ss = "this is the first sentence in this book the first sentence is really the most interesting the first sentence is always first"

In [8]: words = ["first sentence is", "first sentence", "the first sentence", "the first sentence is"]

In [9]: for w in words:
   ...:     tree.add(w)
   ...:

In [10]: tree.make()

In [13]: final = collections.defaultdict(int)

In [15]: for match in tree.findall(ss, allow_overlaps=True):
   ....:     final[ss[match[0]:match[1]]] += 1
   ....:

In [16]: final
{   'the first sentence': 3, 'the first sentence is': 2}

The output I was expecting was this:

{ 
  'the first sentence': 3,
  'the first sentence is': 2,
  'first sentence': 3,
  'first sentence is': 2
}

Am I missing something? I am doing this on large strings so post processing is not my first option. Is there a way to get the desired output?

Lemur · Accepted Answer · 2011-11-11 22:35:25Z

1

I don't know about the ahocorasick module, but those results seem suspect. The acora module shows this:

import acora
import collections

ss = "this is the first sentence in this book "
     "the first sentence is really the most interesting "
     "the first sentence is always first"

words = ["first sentence is", 
         "first sentence",
         "the first sentence",
         "the first sentence is"]

tree = acora.AcoraBuilder(*words).build()

for match in tree.findall(ss):
    result[match] += 1

Results:

>>> result
defaultdict(<type 'int'>, 
            {'the first sentence'   : 3,
             'first sentence'       : 3,
             'first sentence is'    : 2,
             'the first sentence is': 2})

answered Nov 11, 2011 at 22:35

Lemur

4626 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Legend Over a year ago

+1 Thank you. This agrees with my desired output. Do you have any experience with using this for large text by any chance? I mean, performance wise.

Lemur Over a year ago

No direct experience with a large corpus, sorry. The PyPi page says it frees the GIL, and includes a 'fast' CPython' implementation, but beyond that I don't know.

Lemur Over a year ago

Oh, you might also try esmre, culled from here.

Legend Over a year ago

Thank you. I can test them now :) Appreciate your help.

Jim Mischel · Accepted Answer · 2011-11-11 21:59:11Z

1

The way I understand the Aho-Corasick algorithm and the way I've implemented it would have me agree with your expected output. It looks like the Python library you're using is in error, or perhaps there's a flag that you can tell it to give you all matches starting at a position rather than just the longest match starting at a particular position.

The examples in the original paper, http://www.win.tue.nl/~watson/2R080/opdracht/p333-aho-corasick.pdf, support your understanding.

answered Nov 11, 2011 at 21:59

Jim Mischel

135k25 gold badges197 silver badges377 bronze badges

1 Comment

Legend Over a year ago

+1 Thank you for the informative answer. I was afraid this might be the case. That library seems to be pretty heavily used so I was just wondering why no one caught it before.

Collectives™ on Stack Overflow

String searching library's outcome - bug or feature or my coding error?

2 Answers 2

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related