Finding frequent sequence of numbers in an array

Question

Array (3, 5, 1, 3, 5, 48, 4, 7, 13, 55, 65, 4, 7, 13, 32)

the frequent sequence of numbers will be (3, 5) f=2 + (4, 7, 13) f=2

any Algorithm or Pseudo code to find that ?

Update(1):

if (7, 13) also occurrence it will be included in the longest one by update its frequency so

(4, 7, 13) f=3 and so on...

Update(2):

in case of (1,2,3,4,1,2,3,4,1,2,7,8,7,8,3,4,3,4,1,2) the output should be (1,2,3,4) & (3,4,1,2)

& (7,8) , to make it clear consider each number as a word and you want to find most frequent phrases

so it is common to see same word(s) in a lot of phrases but if any phrase was sub-string for any other

phrase(s) should not be consider as a phrase but will update frequency of each phrase includes it

related maybe help (c#) : stackoverflow.com/questions/279359/… — Haim Evgi
– Haim Evgi, Commented Aug 18, 2010 at 6:38

Yanick Rochon · Accepted Answer · 2010-08-19 12:04:22Z

8

** EDIT ** : slightly better implementation, now also returns frequences and has a better sequence filter.

function getFrequences($input, $minimalSequenceSize = 2) {
   $sequences = array();
   $frequences = array();

   $len = count($input);
   for ($i=0; $i<$len; $i++) {
      $offset = $i;

      for ($j=$i+$minimalSequenceSize; $j<$len; $j++) {
         if ($input[$offset] == $input[$j]) {
            $sequenceSize = 1;
            $sequence = array($input[$offset]);
            while (($offset + $sequenceSize < $j) 
               && ($input[$offset+$sequenceSize] == $input[$j+$sequenceSize])) {

               if (false !== ($seqIndex = array_search($sequence, $frequences))) {
                  // we already have this sequence, since we found a bigger one, remove the old one
                  array_splice($sequences, $seqIndex, 1);
                  array_splice($frequences, $seqIndex, 1);
               }              

               $sequence[] = $input[$offset+$sequenceSize];
               $sequenceSize++;
            }

            if ($sequenceSize >= $minimalSequenceSize) {
               if (false !== ($seqIndex = array_search($sequence, $sequences))) {
                  $frequences[$seqIndex]++;
               } else {
                  $sequences[] = $sequence;
                  $frequences[] = 2;  // we have two occurances already
               }
               // $i += $sequenceSize;  // move $i so we don't reuse the same sub-sequence
               break;
            }
         }
      }
   }

   // remove sequences that are sub-sequence of another frequence
   // ** comment this to keep all sequences regardless ** 
   $len = count($sequences);
   for ($i=0; $i<$len; $i++) {
      $freq_i = $sequences[$i];
      for ($j=$i+1; $j<$len; $j++) {
         $freq_j = $sequences[$j];
         $freq_inter = array_intersect($freq_i, $freq_j);
         if (count($freq_inter) != 0) {
            $len--;
            if (count($freq_i) > count($freq_j)) {
               array_splice($sequences, $j, 1);
               array_splice($frequences, $j, 1);
               $j--;
            } else {
               array_splice($sequences, $i, 1);
               array_splice($frequences, $i, 1); 
               $i--;
               break;
            }
         }
      }
   }

   return array($sequences, $frequences);
};

Test case

header('Content-type: text/plain');

$input = array(3, 5, 1, 3, 5, 48, 4, 7, 13, 55, 3, 5, 65, 4, 7, 13, 32, 5, 48, 4, 7, 13);

list($sequences, $frequences) = getFrequences($input);
foreach ($sequences as $i => $s) {
   echo "(" . implode(',', $s) . ') f=' . $frequences[$i] . "\n";
}

** EDIT ** : here's an update to the function. It was almost completely rewritten... tell me if this is what you were looking for. I also added a redundancy check to prevent counting the same sequence, or subsequence, twice.

function getFrequences2($input, $minSequenceSize = 2) {
  $sequences = array();

  $last_offset = 0;
  $last_offset_len = 0;

  $len = count($input);
  for ($i=0; $i<$len; $i++) {
     for ($j=$i+$minSequenceSize; $j<$len; $j++) {
        if ($input[$i] == $input[$j]) {
           $offset = 1;
           $sub = array($input[$i]);
           while ($i + $offset < $j && $j + $offset < $len) {
              if ($input[$i + $offset] == $input[$j + $offset]) {
                 array_push($sub, $input[$i + $offset]);
              } else {
                 break;
              }
              $offset++;
           }

           $sub_len = count($sub);
           if ($sub_len >= $minSequenceSize) {
              // $sub must contain more elements than the last sequence found
              // otherwise we will count the same sequence twice
              if ($last_offset + $last_offset_len >= $i + $sub_len) {
                 // we already saw this sequence... ignore
                 continue;
              } else {
                 // save offset and sub_len for future check
                 $last_offset = $i;
                 $last_offset_len = $sub_len;
              }

              foreach ($sequences as & $sequence) {
                 $sequence_len = count($sequence['values']);
                 if ($sequence_len == $sub_len && $sequence['values'] == $sub) {
                    //echo "Found add-full ".var_export($sub, true)." at $i and $j...\n";
                    $sequence['frequence']++;
                    break 2;
                 } else {
                    if ($sequence_len > $sub_len) {
                       $end = $sequence_len - $sub_len;
                       $values = $sequence['values'];
                       $slice_len = $sub_len;
                       $test = $sub;
                    } else {
                       $end = $sub_len - $sequence_len;
                       $values = $sub;
                       $slice_len = $sequence_len;
                       $test = $sequence['values'];
                    }
                    for ($k=0; $k<=$end; $k++) {
                       if (array_slice($values, $k, $slice_len) == $test) {
                          //echo "Found add-part ".implode(',',$sub)." which is part of ".implode(',',$values)." at $i and $j...\n";
                          $sequence['values'] = $values;
                          $sequence['frequence']++;
                          break 3;
                       }
                    }
                 }
              }

              //echo "Found new ".implode(',',$sub)." at $i and $j...\n";
              array_push($sequences, array('values' => $sub, 'frequence' => 2));
              break;
           }
        }
     }
  }

  return $sequences;
};

edited Aug 19, 2010 at 12:04

answered Aug 18, 2010 at 6:59

Yanick Rochon

53.8k30 gold badges146 silver badges231 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Chuck Burgess Over a year ago

Worked for me. Works well. It even eliminates the duplicated 4,7 and only shows the 4,7,13. Nice work!

Yanick Rochon Over a year ago

I found some potential problems and fixed them in this one. The algorithm now also return the frequence for each sequence. Cheers! If you find any error, please tell me so I can update/fix this answer.

Zamblek Over a year ago

Nice work, But in case of '(1,2,3,4,1,2,3,4,1,2,7,8,7,8,3,4,3,4,1,2)' the output should be '(1,2,3,4)' & '(3,4,1,2)' & '(7,8)' , it gives only (3,4,1,2) & (7,8) to make it clear for you consider each number as a word and you want to find most frequent phrases so it is common to see same word(s) in a lot of phrases but if any phrase was sub-string for any other phrase(s) should not be consider as a phrase but will update frequency of each phrase includes it.

Yanick Rochon Over a year ago

I think I understand. I'll update my solution later, however. I'm in a little rush at the moment.

Zamblek Over a year ago

Thanks, I will test it and back to you with the results

|

John La Rooy · Accepted Answer · 2010-08-18 07:59:44Z

1

In Python3

>>> from collections import Counter
>>> count_hash=Counter()
>>> T=(3, 5, 1, 3, 5, 48, 4, 7, 13, 55, 65, 4, 7, 13, 32)
>>> for i in range(2,len(T)+1):
...     for j in range(len(T)+1-i):
...         count_hash[T[j:j+i]]+=1
... 
>>> for k,v in count_hash.items():
...     if v >= 2:
...         print(k,v)
... 
(3, 5) 2
(4, 7, 13) 2
(7, 13) 2
(4, 7) 2

Do you need to filter the (7,13) and the (4,7) out? What happens if there was also (99, 7, 14) in the sequence?

a Counter is just like a hash used to keep track of the number of times we see each substring
The two nested for loops produce all the substrings of T, using count_hash to accumulate the count of each substring.
The final for loop filters all those substrings that only occurred once

Here is a version with a filter

from collections import Counter
def substrings(t, minlen=2):
    tlen = len(t)
    return (t[j:j+i] for i in range(minlen, tlen+1) for j in range(tlen+1-i))

def get_freq(*t):
    counter = Counter(substrings(t))
    for k in sorted(counter, key=len):
        v=counter[k]
        if v < 2:
            del counter[k]
            continue
        for t in substrings(k):
            if t in counter:
                if t==k:
                    continue
                counter[k]+=counter[t]-v
                del counter[t]
    return counter

print(get_freq(3, 5, 1, 3, 5, 48, 4, 7, 13, 55, 65, 4, 7, 13, 32, 4, 7))
print(get_freq(1,2,3,4,1,2,3,4,1,2,7,8,7,8,3,4,3,4,1,2))

the output is

Counter({(4, 7, 13): 3, (3, 5): 2})
Counter({(1, 2, 3, 4, 1, 2): 8, (7, 8): 2}) # Is this the right answer?

Which is why I asked how the filtering should work for the sequence I gave in the comments

edited Aug 18, 2010 at 7:59

answered Aug 18, 2010 at 6:36

John La Rooy

306k54 gold badges378 silver badges514 bronze badges

7 Comments

Zamblek Over a year ago

Yes I need to filter them, I search only for the longest sequence of numbers and ignore any sub-sequence included already in it, Can you write the code in general or by any other language like Java or C++ or PHP

John La Rooy Over a year ago

@cdburgess, The questions asks for an algorithm or pseudo code. This is an algorithm

John La Rooy Over a year ago

@D3VELOPER, but what happens if the substring (7,14) also occurs in other places (not following a 4)?

Zamblek Over a year ago

it will be included in the frequency of the largest one

John La Rooy Over a year ago

@D3VELOPER, what would be the output for (1,2,3,4,1,2,3,4,1,2,7,8,7,8,3,4,3,4,1,2)?

|

questzen · Accepted Answer · 2010-08-18 06:41:01Z

0

Ok, just to start off the discussion.

Create another array/map, call this weightage array.
Start iterating on the values array.
For each value in values array,increment the corresponding position in weightage array. Eg: for 3 increase weightage[3]++, for 48 weightage[48]++.
After the iteration the weightage array contains repetitions

answered Aug 18, 2010 at 6:41

questzen

3,29720 silver badges21 bronze badges

Collectives™ on Stack Overflow

Finding frequent sequence of numbers in an array

3 Answers 3

7 Comments

7 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related