0

This question is complex, so please ask questions for explaining more detail on this quesiton. (ps. Im not native English speaker, thats why...)

the input is that sample sequence with length of 34, while the output is the result part

For now, I have a sequence sample that is length of 34, it may be constructed as: ("result" is what I need)

The sample sequence = result part + known sequence (I didnt know the length of result part)

  1. result (length 34)
  2. result (length N, N < 34) + known sequence (34 - N)

All those numbers in sequence are random.

Right now, I need to find result part without including the sequence known part.

Some background info:

  1. I have 10 millions this sample sequence with length of 34. (10 millions of knowing 34 digits random number sequence from generator)

  2. After I find the result, I will need to make it to compare on a 5 million length of sequence B, and find if the result sequence is uniquely match on the long sequence somewhere.

My current algrothm is to use a detector which is first 10 digits of known sequence, and remove the sequence after if I detect detecter sequence somewhere in sample sequence. However, there is still a chance that result contains the part of sequence inside of the known sequence. Does anyone has a better algrothm?

Thanks so much! In addition, I'm programming this under python.

ex.

1st condition:

199010104761700150004736290473629657 == sample seq

all are result and known part still the same

input:

199010104761700150004736290473629657

output:

199010104761700150004736290473629657

2nd condition:

199010104728392817111123995561547659 == sample seq

1990101047 == result part

28392817111123995561547659... == known part

input will be: 199010104728392817111123995561547659, 28392817111123995561547659...

output I want is: 1990101047

6
  • 3
    sample input & sample output? Commented Jun 12, 2012 at 16:53
  • This really isn't clear. Are you saying you have 10 million sample sequences with different "results", or 10 million sequences with the same "result" but different "known sequences", or 10 million sequences with different values of "N"? Is the value of "N" known? Part 2 sounds like a completely unrelated problem... Commented Jun 12, 2012 at 16:54
  • In summary, I agree that you should add some simple examples to your question, in order to illustrate what's going on here. Commented Jun 12, 2012 at 16:55
  • Please tell me where exactly is not clear, so I can improve the question. I'm sorry about all those. Thanks guys! I will go lunch back in an hour. Thanks for all! Commented Jun 12, 2012 at 17:02
  • I gave an example and it is my current view of the algrothm of this question. I want a better algrothm if possible. Commented Jun 12, 2012 at 17:11

1 Answer 1

1

You could use the Knuth–Morris–Pratt algorithm. You won't actually find the substring, but you can take note of the value of i when you reach the end of the subject string.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for answering anyway! That is actually very helpful, thanks a lot!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.