6

Suppose I have a list of events. For example A, D, T, H, U, A, B, F, H, ....

What I need is to find frequent patterns that occur in the complete sequence. In this problem we cannot use traditional algorithms like a priori or fp growth because they require separate item sets. And, I cannot break this stream into smaller sets.

Any idea which algorithm would work for me?


EDIT

For example, for the sequence A, D, T, H, U, A, D, T, H, T, H, U, A, H, T, H and with min_support = 2.

The frequent patterns will be

Of length 1 --> [A, D, T, H, U]
Of length 2 --> [AD, DT, TH, HU, UA, HT]
Of length 3 --> [ADT, DTH, THU, HUA]
Of length 4 --> [ADTH, THUA]
No sequences of length 5 and further
8
  • I think the question is far too broad, but as a first guess, you might want to have a look at iSAX Commented Oct 18, 2015 at 11:25
  • I just want to find frequent patters of all lengths in that one large stream. I could not find anything on the Internet after searching a lot. Commented Oct 18, 2015 at 11:28
  • "String" compression algorithms try to capitalise on (at least locally) predictable non-uniformity in sequence probability. Commented Nov 9, 2015 at 9:20
  • @greybeard, i didn't get you completely. Can you explain a little more please. Commented Nov 9, 2015 at 16:03
  • Far as I remember, J.A. Storer was the one introducing the no(ta)tion of "text contraction" using Original Pointer Macros (OPM), External Pointer Macros, the combination thereof, and Compress Pointer Macros (EPM, OEPM, CPM) - the optimal use of all of which has been proven to be intractable. (Macro: (start, length)). Of the variations and restrictions, using original pointers in one direction only allowed a linear solution starting at the other end; information about possible targets coming from a suffix tree. (It's been a couple of decades, a suffix array might be to-day's choice.) Commented Nov 10, 2015 at 9:08

4 Answers 4

2
+50

You can try aho-corasick algorithm with a wildcard and/or just with all substrings. Aho-corasick is basically a finite state machine it needs a dictionary but then it find multiple pattern in the search string very fast. You can build a finite state machine with a trie and a breadth-first search. Here is nice example with animation:http://blog.ivank.net/aho-corasick-algorithm-in-as3.html. So you need basically 2 steps: build the finite state machine and search the string.

Sign up to request clarification or add additional context in comments.

1 Comment

Its very close to building a suffix tree for all the possible substrings, and then using that to check for patterns later. Actually, that is what I am considering.
0

You can generate all possible substrings, eg.:

A
AD
ADT
ADTH
...
D
DT
DTH
...

Now the question is, does the order of elements the smaller substrings matter.

If not you can try and run standard association mining algorithms.

If yes, then the order matters in the whole sequence and its subsequences, which makes this a signal processing or time series problem. But even if the order matters we can continue analyzing this way, with all substrings. We can try matching them, exact match or fuzzy match and stuff like that.

3 Comments

Won't that take a lot of time for a very big sequence. To generate all possible substrings only will take exponential time.
There are n^2 substrings. I think it's feasible.
that seems feasible, but i need to store each sequence with its frequency of occurrence to select the optimal one.
0

That is a particular variation of frequent itemset mining, known as sequential pattern mining.

If you look for this topic, you will find literally dozens of algorithms.

There is GSP, SPADE, PrefixSpan, and many more.

3 Comments

One cannot use GSP. or SPADE because they work on already appearing sequences that are seperate from one another. Not one big continuous sequence.
You could run it on ngrams of that sequence then, for example.
I didn't get you, can you elaborate a little by editing your answer.
0

Here's a simple algorithm (in JavaScript) that will generate a count of all substrings.

Keep a count of substring occurrences in a dictionary. Iterate over every possible substring in the stream, and if it is already in the dictionary, increment it, otherwise add it with a value of 1.

var stream = 'FOOBARFOO';
var substrings = {};
var minimumSubstringLength = 2;

for (var i = 1; i <= stream.length; i++) {
    for (var j = 0; j <= i - minimumSubstringLength; j++) {
        var substring = stream.substring(j, i);
        substrings[substring] ? substrings[substring]++ : substrings[substring] = 1;
    }
}

Then use a sorting algorithm to order the dictionary by its values.

1 Comment

Yes, thats already been suggested. But i want something more efficient then bruteforce.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.