There is an array indexed from 0-n (i.e. size=n) containing elements from 0-m where m < n (assume m is 100 or 1000 times less than n i.e. m is much less than n), so many of the elements or sub array must be repeated and we have to find the number of such repeated sub array of size 1 or more than 1. For example, consider an array
1 2 4 1 2 4 1 5 6 3 5, here
1 is repeated 2 times
2 is repeated 1 time
4 is repeated 1 time
5 is repeated 1 time
1 2 is repeated 1 time
2 4 is repeated 1 time
4 1 is repeated 1 time
1 2 4 is repeated 1 time
2 4 1 is repeated 1 time
1 2 4 1 is repeated 1 time
So the final answer is sum of these i.e. 11
Can anybody please suggest some data structure or an efficient algorithm to solve this. I was thinking of hashing m and noting the index value where it occurs but didn't formalize it.
Assume n,m < 10^5
3 Answers
Create a hash that has integers as keys, and extendible lists (linked lists, e.g.) of integers as values.
Read through your input list. Treat each input element being scanned as a key; append the index of the following element to the associated list of values.
Now, your repeated 1-element count is n - |keys|
Next, go through your keys. For each key, treat the indexed elements of the original list as a new input list. Create a new hash and continue as in step 1. Note that when we store indices we continue to use those of the original list.
Example: 1 2 4 1 2 4 1 5 6 3 5 (assume we're zero-indexed). Here, n=11.
- 1 -> [1, 4, 7]
- 2 -> [2, 5]
- 3 -> [10]
- 4 -> [3, 6]
- 5 -> [8, nil]
- 6 -> [9]
The number of repeated 1-elts is 11 - 6 = 5.
Now, we go through the keys. We start with 1. The index list is [1, 4, 7] which corresponds with elements [2, 2, 5].
- 2 -> [2, 5]
- 5 -> [8]
This picks up 3 - 2 = 1 duplicate 2-element list.
etc...
Running time: Linear in the input + output
1 Comment
I used an approach based on suffix trees, from Skienna's book on the design of algorithms. Suffix trees are a special kind of trie tree, where you insert every suffix of the given string into the trie tree. Trie trees are a way of storing multiple words in a single tree: the root is the null string, and each edge adds a character. I have used a very simple implementation as a proof of concept, below.
#include <iostream>
#include <string>
using std::string;
using std::cout;
class Node
{
Node* m_descendants[10];
int m_count;
public:
Node()
{
for(int i=0; i<10; ++i) { m_descendants[i] = nullptr; }
m_count = 0;
}
void Add(const char* str) {
// Increment number of times we've reached this node
m_count++;
const int firstDigit = str[0] - '0';
if (!(0 <= firstDigit && firstDigit <= 9)) return; // recursion base case
// Access descendant node correponding to current digit
Node*& descendant = m_descendants[firstDigit];
if (descendant == 0) { // create descendant if necessary
descendant = new Node;
}
// Recurse on rest of string
descendant->Add(str +1);
}
void Print(const string& str, int countAdj=0) const // debug function
{
for(int nextDigit=0; nextDigit<10; ++nextDigit) {
const Node* node = m_descendants[nextDigit];
if (node) {
const int adjustedCount = node->Count() - countAdj;
if (adjustedCount > 0) {
char c = '0' + nextDigit;
string strWithC = str;
strWithC += c;
cout << strWithC << ": " << adjustedCount << "\n";
node->Print(strWithC, countAdj);
}
}
}
}
int SumRepeated() const
{
int sumRepeated = 0;
for(int nextDigit=0; nextDigit<10; ++nextDigit) {
const Node* node = m_descendants[nextDigit];
if (node) {
const int repeatCount = node->Count() -1;
if (repeatCount > 0) {
sumRepeated += repeatCount;
sumRepeated += node->SumRepeated();
}
}
}
return sumRepeated;
}
int Count() const { return m_count; }
};
int main(int argc, const char* argv[])
{
// Construct
Node* const root = new Node;
for(const char* str = "12412415635"; *str; ++str) {
root->Add(str);
}
// Print
root->Print(string(), 1);
cout << "Sum repeated is " << root->SumRepeated() << "\n";
return 0; // (nodes are leaked)
}
Output
1: 2
12: 1
124: 1
1241: 1
2: 1
24: 1
241: 1
4: 1
41: 1
5: 1
Sum repeated is 11
Note there is one additional repeated substring here, i.e. 1241.
As I said, this is proof of concept, so, for example, you'd want to use a dictionary instead of an array to store descendants, with larger m. Also, this implementation is not optimal even in terms of the overall algorithm: it's O(n^2) in time and space. You can optimize by collapsing paths with no branching to get O(n) space, and use clever construction algorithms to get O(n) time. Also, as pointed out in one of the other answers, you don't need to process sub-strings that are up to half the length of the original.
Comments
Here's something in JavaScript: (output is more or less immediate and JavaScript is one of the slowest, I think):
<script>
function f (s, m) {
var i = 0, n = s.length, c = 0, H = {}, Hi = {}
while (i < n-1) {
for (var j=i+1; j<=Math.min (i + 1 + m,n - 1); j++) {
if (s[i] == s[j]) {
var i_= i, j_= j, tmp = ''
while (s[i_] == s[j_] && i_ < j) {
tmp += String (s[i_])
i_++
j_++
}
var k = tmp.length - 1
while (k >= 0) {
if (!H[tmp]) {
H[tmp] = 1
Hi[tmp] = i
c++
}
else if (i == Hi[tmp]) {
H[tmp]++
c++
}
tmp = tmp.substr(0,k--)
}
}
}
i++
}
var t1 = ''
for (var i in H) t1 += i + ", "
return [t1,c]
}
var a1 = [1,2,4,1,2,4,1,5,6,3,5],
a2 = []
for (var i=0;i<100000;i++) a2.push(Math.floor(Math.random()*10))
console.log(a1 +"\n"+ f(a1,10))
console.log(f(a2,10))
</script>
Output:
1,2,4,1,2,4,1,5,6,3,5
1, 2, 4, 5, 12, 24, 41, 124, 241, ,10
["0...6358, 3582, 038, 4304, 2068, 095,
9252, 37866, 3786, 7866, 7893, 8701, 0669, ", 771]
Floor[12412415635 / (10 ^ (10 - n))] mod 10. It works, but would that be an acceptable pattern for you? Probably not. You need to specify what exactly you consider a "pattern". Unless you can give a specification, you can't implement this.