I am writing a Search application that tokenize a big textual corpus.
The text parser needs to remove any gibberish from the text (i.e. [^a-zA-Z0-9])
I had 2 ideas in my head how to do this:
1) Put the text in a string, transform it to a charArray using String.tocharArray and then run char by char with a loop -> while(position < string.length) Doing so I can tokenize the entire string array in one run over the text.
2) Strip all non digit/alpha using string.replace, and then string.split with some delimiters, this means i have to run twice on the entire string. Once to remove bad chars and then again to split it.
I assumed, that since #1 does the same as #2 but in O(n) it would be quicker, but after testing both, #2 is way (way!) faster.
I went even further and viewed the code behind String.Strip using red-gate .net reflector. It runs unmanaged char by char just like #1, but still much much faster.
I have no clue why #2 is way faster than #1.
Any ideas?