String.Split Efficiency Question

Question

I am writing a Search application that tokenize a big textual corpus.

The text parser needs to remove any gibberish from the text (i.e. [^a-zA-Z0-9])

I had 2 ideas in my head how to do this:

1) Put the text in a string, transform it to a charArray using String.tocharArray and then run char by char with a loop -> while(position < string.length) Doing so I can tokenize the entire string array in one run over the text.

2) Strip all non digit/alpha using string.replace, and then string.split with some delimiters, this means i have to run twice on the entire string. Once to remove bad chars and then again to split it.

I assumed, that since #1 does the same as #2 but in O(n) it would be quicker, but after testing both, #2 is way (way!) faster.

I went even further and viewed the code behind String.Strip using red-gate .net reflector. It runs unmanaged char by char just like #1, but still much much faster.

I have no clue why #2 is way faster than #1.

Any ideas?

It is pointless to guess at perf until you post code that everybody can try. For all we know, you really goofed on #1. String.Replace() is heavily optimized because it is an O(nm) algorithm but you still need to run it k times. Your finding doesn't make a lot of sense. — Hans Passant
– Hans Passant, Commented Nov 8, 2010 at 22:35

Michael Goldshteyn · Accepted Answer · 2010-11-08 22:30:03Z

1

How about this idea:

Create a string
Load the entire data set into the string
Create a StringBuilder with enough pre-allocated space to hold the entire string
Go character by character through the string and if the character is alphanumeric, add it to the StringBuilder.
At the end, get the string out of the StringBuilder.

I don't know if this will be any faster to what you've already tried, but timing the above should at least answer that question.

answered Nov 8, 2010 at 22:30

Michael Goldshteyn

74.9k25 gold badges136 silver badges183 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

djTeller Over a year ago

After extra debugging, i noticed my efficiency is very low when handling numbers.

BeemerGuy · Accepted Answer · 2010-11-08 22:38:45Z

0

djTeller,
The fact that #2 is faster is merely relative to your #1 method.
You might want to share your #1 method with us; maybe it's just very slow and is possible to make it faster than #2, even.
Yes both are essentially O(n), but is the ACTUAL implementation O(n); how'd you actually do #1?

Also, when you said you tested both, I hope you did with large amounts of input to overcome the margin of error and see a significant difference between the two.

answered Nov 8, 2010 at 22:38

BeemerGuy

8,2792 gold badges39 silver badges47 bronze badges

1 Comment

djTeller Over a year ago

I tested it on a huge 1 GB corpus. I will paste the code soon.

Collectives™ on Stack Overflow

String.Split Efficiency Question

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related