Need Improve Sequence Detection Algorithm

Question

This question is complex, so please ask questions for explaining more detail on this quesiton. (ps. Im not native English speaker, thats why...)

the input is that sample sequence with length of 34, while the output is the result part

For now, I have a sequence sample that is length of 34, it may be constructed as: ("result" is what I need)

The sample sequence = result part + known sequence (I didnt know the length of result part)

result (length 34)
result (length N, N < 34) + known sequence (34 - N)

All those numbers in sequence are random.

Right now, I need to find result part without including the sequence known part.

Some background info:

I have 10 millions this sample sequence with length of 34. (10 millions of knowing 34 digits random number sequence from generator)
After I find the result, I will need to make it to compare on a 5 million length of sequence B, and find if the result sequence is uniquely match on the long sequence somewhere.

My current algrothm is to use a detector which is first 10 digits of known sequence, and remove the sequence after if I detect detecter sequence somewhere in sample sequence. However, there is still a chance that result contains the part of sequence inside of the known sequence. Does anyone has a better algrothm?

Thanks so much! In addition, I'm programming this under python.

ex.

1st condition:

199010104761700150004736290473629657 == sample seq

all are result and known part still the same

input:

199010104761700150004736290473629657

output:

199010104761700150004736290473629657

2nd condition:

199010104728392817111123995561547659 == sample seq

1990101047 == result part

28392817111123995561547659... == known part

input will be: 199010104728392817111123995561547659, 28392817111123995561547659...

output I want is: 1990101047

This really isn't clear. Are you saying you have 10 million sample sequences with different "results", or 10 million sequences with the same "result" but different "known sequences", or 10 million sequences with different values of "N"? Is the value of "N" known? Part 2 sounds like a completely unrelated problem... — Oliver Charlesworth
– Oliver Charlesworth, Commented Jun 12, 2012 at 16:54
In summary, I agree that you should add some simple examples to your question, in order to illustrate what's going on here. — Oliver Charlesworth
– Oliver Charlesworth, Commented Jun 12, 2012 at 16:55
Please tell me where exactly is not clear, so I can improve the question. I'm sorry about all those. Thanks guys! I will go lunch back in an hour. Thanks for all! — windsound
– windsound, Commented Jun 12, 2012 at 17:02
I gave an example and it is my current view of the algrothm of this question. I want a better algrothm if possible. — windsound
– windsound, Commented Jun 12, 2012 at 17:11

Markus Jarderot · Accepted Answer · 2012-06-19 09:47:46Z

1

You could use the Knuth–Morris–Pratt algorithm. You won't actually find the substring, but you can take note of the value of i when you reach the end of the subject string.

answered Jun 19, 2012 at 9:47

Markus Jarderot

89.7k23 gold badges141 silver badges142 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

windsound Over a year ago

Thanks for answering anyway! That is actually very helpful, thanks a lot!

Collectives™ on Stack Overflow

Need Improve Sequence Detection Algorithm

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related