8

Array (3, 5, 1, 3, 5, 48, 4, 7, 13, 55, 65, 4, 7, 13, 32)

the frequent sequence of numbers will be (3, 5) f=2 + (4, 7, 13) f=2

any Algorithm or Pseudo code to find that ?

Update(1):

if (7, 13) also occurrence it will be included in the longest one by update its frequency so

(4, 7, 13) f=3 and so on...

Update(2):

in case of (1,2,3,4,1,2,3,4,1,2,7,8,7,8,3,4,3,4,1,2) the output should be (1,2,3,4) & (3,4,1,2)

& (7,8) , to make it clear consider each number as a word and you want to find most frequent phrases

so it is common to see same word(s) in a lot of phrases but if any phrase was sub-string for any other

phrase(s) should not be consider as a phrase but will update frequency of each phrase includes it

1

3 Answers 3

8

** EDIT ** : slightly better implementation, now also returns frequences and has a better sequence filter.

function getFrequences($input, $minimalSequenceSize = 2) {
   $sequences = array();
   $frequences = array();

   $len = count($input);
   for ($i=0; $i<$len; $i++) {
      $offset = $i;

      for ($j=$i+$minimalSequenceSize; $j<$len; $j++) {
         if ($input[$offset] == $input[$j]) {
            $sequenceSize = 1;
            $sequence = array($input[$offset]);
            while (($offset + $sequenceSize < $j) 
               && ($input[$offset+$sequenceSize] == $input[$j+$sequenceSize])) {

               if (false !== ($seqIndex = array_search($sequence, $frequences))) {
                  // we already have this sequence, since we found a bigger one, remove the old one
                  array_splice($sequences, $seqIndex, 1);
                  array_splice($frequences, $seqIndex, 1);
               }              

               $sequence[] = $input[$offset+$sequenceSize];
               $sequenceSize++;
            }

            if ($sequenceSize >= $minimalSequenceSize) {
               if (false !== ($seqIndex = array_search($sequence, $sequences))) {
                  $frequences[$seqIndex]++;
               } else {
                  $sequences[] = $sequence;
                  $frequences[] = 2;  // we have two occurances already
               }
               // $i += $sequenceSize;  // move $i so we don't reuse the same sub-sequence
               break;
            }
         }
      }
   }

   // remove sequences that are sub-sequence of another frequence
   // ** comment this to keep all sequences regardless ** 
   $len = count($sequences);
   for ($i=0; $i<$len; $i++) {
      $freq_i = $sequences[$i];
      for ($j=$i+1; $j<$len; $j++) {
         $freq_j = $sequences[$j];
         $freq_inter = array_intersect($freq_i, $freq_j);
         if (count($freq_inter) != 0) {
            $len--;
            if (count($freq_i) > count($freq_j)) {
               array_splice($sequences, $j, 1);
               array_splice($frequences, $j, 1);
               $j--;
            } else {
               array_splice($sequences, $i, 1);
               array_splice($frequences, $i, 1); 
               $i--;
               break;
            }
         }
      }
   }

   return array($sequences, $frequences);
};

Test case

header('Content-type: text/plain');

$input = array(3, 5, 1, 3, 5, 48, 4, 7, 13, 55, 3, 5, 65, 4, 7, 13, 32, 5, 48, 4, 7, 13);

list($sequences, $frequences) = getFrequences($input);
foreach ($sequences as $i => $s) {
   echo "(" . implode(',', $s) . ') f=' . $frequences[$i] . "\n";
}

** EDIT ** : here's an update to the function. It was almost completely rewritten... tell me if this is what you were looking for. I also added a redundancy check to prevent counting the same sequence, or subsequence, twice.

function getFrequences2($input, $minSequenceSize = 2) {
  $sequences = array();

  $last_offset = 0;
  $last_offset_len = 0;

  $len = count($input);
  for ($i=0; $i<$len; $i++) {
     for ($j=$i+$minSequenceSize; $j<$len; $j++) {
        if ($input[$i] == $input[$j]) {
           $offset = 1;
           $sub = array($input[$i]);
           while ($i + $offset < $j && $j + $offset < $len) {
              if ($input[$i + $offset] == $input[$j + $offset]) {
                 array_push($sub, $input[$i + $offset]);
              } else {
                 break;
              }
              $offset++;
           }

           $sub_len = count($sub);
           if ($sub_len >= $minSequenceSize) {
              // $sub must contain more elements than the last sequence found
              // otherwise we will count the same sequence twice
              if ($last_offset + $last_offset_len >= $i + $sub_len) {
                 // we already saw this sequence... ignore
                 continue;
              } else {
                 // save offset and sub_len for future check
                 $last_offset = $i;
                 $last_offset_len = $sub_len;
              }

              foreach ($sequences as & $sequence) {
                 $sequence_len = count($sequence['values']);
                 if ($sequence_len == $sub_len && $sequence['values'] == $sub) {
                    //echo "Found add-full ".var_export($sub, true)." at $i and $j...\n";
                    $sequence['frequence']++;
                    break 2;
                 } else {
                    if ($sequence_len > $sub_len) {
                       $end = $sequence_len - $sub_len;
                       $values = $sequence['values'];
                       $slice_len = $sub_len;
                       $test = $sub;
                    } else {
                       $end = $sub_len - $sequence_len;
                       $values = $sub;
                       $slice_len = $sequence_len;
                       $test = $sequence['values'];
                    }
                    for ($k=0; $k<=$end; $k++) {
                       if (array_slice($values, $k, $slice_len) == $test) {
                          //echo "Found add-part ".implode(',',$sub)." which is part of ".implode(',',$values)." at $i and $j...\n";
                          $sequence['values'] = $values;
                          $sequence['frequence']++;
                          break 3;
                       }
                    }
                 }
              }

              //echo "Found new ".implode(',',$sub)." at $i and $j...\n";
              array_push($sequences, array('values' => $sub, 'frequence' => 2));
              break;
           }
        }
     }
  }

  return $sequences;
};
Sign up to request clarification or add additional context in comments.

7 Comments

Worked for me. Works well. It even eliminates the duplicated 4,7 and only shows the 4,7,13. Nice work!
I found some potential problems and fixed them in this one. The algorithm now also return the frequence for each sequence. Cheers! If you find any error, please tell me so I can update/fix this answer.
Nice work, But in case of '(1,2,3,4,1,2,3,4,1,2,7,8,7,8,3,4,3,4,1,2)' the output should be '(1,2,3,4)' & '(3,4,1,2)' & '(7,8)' , it gives only (3,4,1,2) & (7,8) to make it clear for you consider each number as a word and you want to find most frequent phrases so it is common to see same word(s) in a lot of phrases but if any phrase was sub-string for any other phrase(s) should not be consider as a phrase but will update frequency of each phrase includes it.
I think I understand. I'll update my solution later, however. I'm in a little rush at the moment.
Thanks, I will test it and back to you with the results
|
1

In Python3

>>> from collections import Counter
>>> count_hash=Counter()
>>> T=(3, 5, 1, 3, 5, 48, 4, 7, 13, 55, 65, 4, 7, 13, 32)
>>> for i in range(2,len(T)+1):
...     for j in range(len(T)+1-i):
...         count_hash[T[j:j+i]]+=1
... 
>>> for k,v in count_hash.items():
...     if v >= 2:
...         print(k,v)
... 
(3, 5) 2
(4, 7, 13) 2
(7, 13) 2
(4, 7) 2

Do you need to filter the (7,13) and the (4,7) out? What happens if there was also (99, 7, 14) in the sequence?

a Counter is just like a hash used to keep track of the number of times we see each substring
The two nested for loops produce all the substrings of T, using count_hash to accumulate the count of each substring.
The final for loop filters all those substrings that only occurred once

Here is a version with a filter

from collections import Counter
def substrings(t, minlen=2):
    tlen = len(t)
    return (t[j:j+i] for i in range(minlen, tlen+1) for j in range(tlen+1-i))

def get_freq(*t):
    counter = Counter(substrings(t))
    for k in sorted(counter, key=len):
        v=counter[k]
        if v < 2:
            del counter[k]
            continue
        for t in substrings(k):
            if t in counter:
                if t==k:
                    continue
                counter[k]+=counter[t]-v
                del counter[t]
    return counter

print(get_freq(3, 5, 1, 3, 5, 48, 4, 7, 13, 55, 65, 4, 7, 13, 32, 4, 7))
print(get_freq(1,2,3,4,1,2,3,4,1,2,7,8,7,8,3,4,3,4,1,2))

the output is

Counter({(4, 7, 13): 3, (3, 5): 2})
Counter({(1, 2, 3, 4, 1, 2): 8, (7, 8): 2}) # Is this the right answer?

Which is why I asked how the filtering should work for the sequence I gave in the comments

7 Comments

Yes I need to filter them, I search only for the longest sequence of numbers and ignore any sub-sequence included already in it, Can you write the code in general or by any other language like Java or C++ or PHP
@cdburgess, The questions asks for an algorithm or pseudo code. This is an algorithm
@D3VELOPER, but what happens if the substring (7,14) also occurs in other places (not following a 4)?
it will be included in the frequency of the largest one
@D3VELOPER, what would be the output for (1,2,3,4,1,2,3,4,1,2,7,8,7,8,3,4,3,4,1,2)?
|
0

Ok, just to start off the discussion.

  1. Create another array/map, call this weightage array.
  2. Start iterating on the values array.
  3. For each value in values array,increment the corresponding position in weightage array. Eg: for 3 increase weightage[3]++, for 48 weightage[48]++.
  4. After the iteration the weightage array contains repetitions

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.