Pattern in continuous sequence data

Question

Suppose I have a list of events. For example A, D, T, H, U, A, B, F, H, ....

What I need is to find frequent patterns that occur in the complete sequence. In this problem we cannot use traditional algorithms like a priori or fp growth because they require separate item sets. And, I cannot break this stream into smaller sets.

Any idea which algorithm would work for me?

EDIT

For example, for the sequence A, D, T, H, U, A, D, T, H, T, H, U, A, H, T, H and with min_support = 2.

The frequent patterns will be

Of length 1 --> [A, D, T, H, U]
Of length 2 --> [AD, DT, TH, HU, UA, HT]
Of length 3 --> [ADT, DTH, THU, HUA]
Of length 4 --> [ADTH, THUA]
No sequences of length 5 and further

I think the question is far too broad, but as a first guess, you might want to have a look at iSAX — Marco13
– Marco13, Commented Oct 18, 2015 at 11:25
I just want to find frequent patters of all lengths in that one large stream. I could not find anything on the Internet after searching a lot. — Haris
– Haris, Commented Oct 18, 2015 at 11:28
"String" compression algorithms try to capitalise on (at least locally) predictable non-uniformity in sequence probability. — greybeard
– greybeard, Commented Nov 9, 2015 at 9:20
@greybeard, i didn't get you completely. Can you explain a little more please. — Haris
– Haris, Commented Nov 9, 2015 at 16:03
Far as I remember, J.A. Storer was the one introducing the no(ta)tion of "text contraction" using Original Pointer Macros (OPM), External Pointer Macros, the combination thereof, and Compress Pointer Macros (EPM, OEPM, CPM) - the optimal use of all of which has been proven to be intractable. (Macro: (start, length)). Of the variations and restrictions, using original pointers in one direction only allowed a linear solution starting at the other end; information about possible targets coming from a suffix tree. (It's been a couple of decades, a suffix array might be to-day's choice.) — greybeard
– greybeard, Commented Nov 10, 2015 at 9:08

Cybercartel · Accepted Answer · 2015-11-16 10:27:06Z

2

+50

You can try aho-corasick algorithm with a wildcard and/or just with all substrings. Aho-corasick is basically a finite state machine it needs a dictionary but then it find multiple pattern in the search string very fast. You can build a finite state machine with a trie and a breadth-first search. Here is nice example with animation:http://blog.ivank.net/aho-corasick-algorithm-in-as3.html. So you need basically 2 steps: build the finite state machine and search the string.

edited Nov 16, 2015 at 10:27

answered Nov 12, 2015 at 19:48

Cybercartel

12.6k7 gold badges39 silver badges73 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Haris Over a year ago

Its very close to building a suffix tree for all the possible substrings, and then using that to check for patterns later. Actually, that is what I am considering.

dimm · Accepted Answer · 2015-10-18 11:48:59Z

0

You can generate all possible substrings, eg.:

A
AD
ADT
ADTH
...
D
DT
DTH
...

Now the question is, does the order of elements the smaller substrings matter.

If not you can try and run standard association mining algorithms.

If yes, then the order matters in the whole sequence and its subsequences, which makes this a signal processing or time series problem. But even if the order matters we can continue analyzing this way, with all substrings. We can try matching them, exact match or fuzzy match and stuff like that.

answered Oct 18, 2015 at 11:48

dimm

1,80813 silver badges16 bronze badges

3 Comments

Haris Over a year ago

Won't that take a lot of time for a very big sequence. To generate all possible substrings only will take exponential time.

dimm Over a year ago

There are n^2 substrings. I think it's feasible.

Haris Over a year ago

that seems feasible, but i need to store each sequence with its frequency of occurrence to select the optimal one.

Has QUIT--Anony-Mousse · Accepted Answer · 2015-10-18 14:58:31Z

0

That is a particular variation of frequent itemset mining, known as sequential pattern mining.

If you look for this topic, you will find literally dozens of algorithms.

There is GSP, SPADE, PrefixSpan, and many more.

answered Oct 18, 2015 at 14:58

Has QUIT--Anony-Mousse

77.8k14 gold badges146 silver badges198 bronze badges

3 Comments

Haris Over a year ago

One cannot use GSP. or SPADE because they work on already appearing sequences that are seperate from one another. Not one big continuous sequence.

Has QUIT--Anony-Mousse Over a year ago

You could run it on ngrams of that sequence then, for example.

Haris Over a year ago

I didn't get you, can you elaborate a little by editing your answer.

James Brierley · Accepted Answer · 2015-11-09 16:21:16Z

0

Here's a simple algorithm (in JavaScript) that will generate a count of all substrings.

Keep a count of substring occurrences in a dictionary. Iterate over every possible substring in the stream, and if it is already in the dictionary, increment it, otherwise add it with a value of 1.

var stream = 'FOOBARFOO';
var substrings = {};
var minimumSubstringLength = 2;

for (var i = 1; i <= stream.length; i++) {
    for (var j = 0; j <= i - minimumSubstringLength; j++) {
        var substring = stream.substring(j, i);
        substrings[substring] ? substrings[substring]++ : substrings[substring] = 1;
    }
}

Then use a sorting algorithm to order the dictionary by its values.

answered Nov 9, 2015 at 16:21

James Brierley

4,6801 gold badge22 silver badges41 bronze badges

1 Comment

Haris Over a year ago

Yes, thats already been suggested. But i want something more efficient then bruteforce.

Collectives™ on Stack Overflow

Pattern in continuous sequence data

4 Answers 4

1 Comment

3 Comments

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

3 Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related