2

I am writing a Search application that tokenize a big textual corpus.

The text parser needs to remove any gibberish from the text (i.e. [^a-zA-Z0-9])

I had 2 ideas in my head how to do this:

1) Put the text in a string, transform it to a charArray using String.tocharArray and then run char by char with a loop -> while(position < string.length) Doing so I can tokenize the entire string array in one run over the text.

2) Strip all non digit/alpha using string.replace, and then string.split with some delimiters, this means i have to run twice on the entire string. Once to remove bad chars and then again to split it.

I assumed, that since #1 does the same as #2 but in O(n) it would be quicker, but after testing both, #2 is way (way!) faster.

I went even further and viewed the code behind String.Strip using red-gate .net reflector. It runs unmanaged char by char just like #1, but still much much faster.

I have no clue why #2 is way faster than #1.

Any ideas?

2
  • As described, I think both method 1 and method 2 are O(n) Commented Nov 8, 2010 at 22:26
  • 2
    It is pointless to guess at perf until you post code that everybody can try. For all we know, you really goofed on #1. String.Replace() is heavily optimized because it is an O(nm) algorithm but you still need to run it k times. Your finding doesn't make a lot of sense. Commented Nov 8, 2010 at 22:35

2 Answers 2

1

How about this idea:

  1. Create a string
  2. Load the entire data set into the string
  3. Create a StringBuilder with enough pre-allocated space to hold the entire string
  4. Go character by character through the string and if the character is alphanumeric, add it to the StringBuilder.
  5. At the end, get the string out of the StringBuilder.

I don't know if this will be any faster to what you've already tried, but timing the above should at least answer that question.

Sign up to request clarification or add additional context in comments.

1 Comment

After extra debugging, i noticed my efficiency is very low when handling numbers.
0

djTeller,
The fact that #2 is faster is merely relative to your #1 method.
You might want to share your #1 method with us; maybe it's just very slow and is possible to make it faster than #2, even.
Yes both are essentially O(n), but is the ACTUAL implementation O(n); how'd you actually do #1?

Also, when you said you tested both, I hope you did with large amounts of input to overcome the margin of error and see a significant difference between the two.

1 Comment

I tested it on a huge 1 GB corpus. I will paste the code soon.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.