14

I have a string like

string Text = "012345678901234567890123456789";

and a List<int> with indexes

List<int> Indexes = new List<int>() { 2, 4, 7, 9, 15, 18, 23, 10, 1, 2, 15, 40 };

with following restrictions

  • there are duplicates within the list
  • the list is not sorted
  • there may be indexes > Text.length

what's the best way to remove characters from the text which are in the index list?

expected output:

035681234679012456789

Is there a more efficent way than

foreach (int index in Indexes
                        .OrderByDescending(x => x)
                        .Distinct()
                        .Where(x => x < Text.Length))
{
    Text = Text.Remove(index, 1);
}

Update: Here are the Benchmarks of the current answers (string with 100.000 characters and List<int> with length 10.000:

Gallant: 3.322 ticks
Tim Schmelter: 8.602.576 ticks
Sergei Zinovyev: 9.002 ticks
rbaghbanli: 7.137 ticks
Jirí Tesil Tesarík: 72.580 ticks
1
  • 1
    The question is why do you need the list at all? Can't you use a HashSet<int> in the first place instead? Then you don't have duplicates and a lookup with Contains is much more efficient O(1). Commented Apr 11, 2016 at 15:25

5 Answers 5

11

Here's a more or less elegant LINQ way:

Text = new string(Text.Where((c, index) => !Indexes.Contains(index)).ToArray());

It uses the overload of Enumerable.Where that projects the index of the item in the sequence.

If you want the most efficient and not the most readable way and the text is really large you could use a HashSet<int> instead of the list which doesn't allow duplicates and a StringBuilder to create the new string:

var indexSet = new HashSet<int>(Indexes); // either create from the list(as shown here) or use it without your list
var textBuilder = new StringBuilder(Text.Length);

for(int i = 0; i < Text.Length; i++)
    if (!indexSet.Contains(i))
        textBuilder.Append(Text[i]);
Text = textBuilder.ToString();

Of course you could also use the HashSet<int> in the LINQ approach to make it more efficient.

Sign up to request clarification or add additional context in comments.

8 Comments

That will be O(n^2). There is a solution with O(n).
If you use the HashSet in the Linq approach that would still be O(n).
Building binary tree is not fast. Whatever you win in the loop, you are loosing at the construction of binary tree.
@Tim Schmelter, if you get HashSet as an input - that is an assumption, don't you think? OP explicitly indicated that Indexes is List<int>, not HashSet<int>.
@rbaghbanli: you are making more assumptions. It could be that OP isn't ware of the HashSet class. Also, if the list doesn't change (often) you can also hold the set as lookup instance whereas you keep the list. Then you gain performance with a little bit more memory costs. However, it's worth to mention it even if OP knows it and can't use it. In many other cases it will help to improve performance.
|
9

This will work faster:

string Text = "012345678901234567890123456789";
List<int> Indexes = new List<int>() { 2, 4, 7, 9, 15, 18, 23, 10, 1, 2, 15, 40 };

HashSet<int> hashSet = new HashSet<int>(Indexes);

StringBuilder sb = new StringBuilder(Text.Length);
for (int i = 0; i < Text.Length; ++i)
{
    if (!hashSet.Contains(i))
    {
        sb.Append(Text[i]);
    }
}

string str = sb.ToString();

6 Comments

HashSet<int> hashSet = new HashSet<int>(Indexes) - is not fast at all.
@rbaghbanli Compared to .OrderByDescending(x=>x).Distinct() it's fast.
True, but there are faster solutions.
Did you include the creation of HashSet within the test and what sizes you benchmarked it over?
Building HashSet is O(n^2). Have a look, for example, here: stackoverflow.com/questions/15322386/…
|
7

Yes, see code below (it will iterate only once over each sequence):

var map = new short[Text.Length];
foreach (var i in Indexes)
{
    if (i < text.Count)
        map[i] = 1;
}
Text = new string(Text.Where((c, i) => map[i] == 0).ToArray());

11 Comments

By calling RemoveAt you alter the indexes of the array. So this only works if you start with indexes at the end of the array and remove the duplicates. That's why the OP was using OrderByDescending and Distinct.
Judging by the code from OP that is desired behaviour.
No. Imagine if the Indexes as just the number 0 repeated 29 times. The OP's code would only remove the first character while your code would remove all of them. It's sad how many people don't realize that this code is completely wrong.
I am not sure that building binary tree is fast process. The longer the Indexes the longer it will take to build HashSet.
Why is it more obvious that short defaults to 0 than bool defaults to false? IMO bool is better because you can name the array something like isInList and then do Where((c,i) => !isInList[i]). And you avoid using magic numbers.
|
5

The following is making an assumption that your string contains a known set of characters. If you know for certain that, for example, Unicode character never appears in the string, you can use it as a placeholder to mark which characters to remove. This should be very fast in exchange for this limitation:

char temp = '\uFFF0';
StringBuilder sb = new StringBuilder(Text);
for (int i = 0; i < Indexes.Count; i++)
{
    if (Indexes[i] < sb.Length)
    {
        sb[Indexes[i]] = temp;
    }
}

Text = sb.Replace(temp.ToString(), null).ToString();

This appears to be anywhere between 3-4 times faster than building a HashSet like some other answers have suggested. http://ideone.com/mUILHg


If you cannot make the above assumption, you can build an array to contain this extra bit of data instead of using a unique character. This does two rounds of iteration (so it's a bit slower), but it is still O(n) efficiency (so it should typically be faster than putting Indexes into a hashmap before iterating).

bool[] exclude = new bool[Text.Length];
for (int i = 0; i < Indexes.Count; i++)
{
    if (Indexes[i] < exclude.Length)
    {
        exclude[Indexes[i]] = true;
    }
}
StringBuilder sb = new StringBuilder(Text.Length);
for (int i = 0; i < Text.Length; i++)
{
    if (!exclude[i])
    {
        sb.Append(Text[i]);
    }
}
Text = sb.ToString();

Quick benchmarks: http://ideone.com/3d2uPH

4 Comments

Why didn't you use the null char? stackoverflow.com/questions/3670505/…
surprisingly char array gave me the same speed: var a = Text.ToCharArray(); foreach (int i in Indexes) if (i < Text.Length) a[i] = ' '; var result = new string(a).Replace(" ", "");
never mind .. char array is faster for me when the Indexes are more than the characters, but gets slower when the characters are 10 times or more than the Indexes
@Byyo That works as well, but valid strings can contain the null character (which is just Unicode character \u0000), so the issue remains.
0

A modified solution using byte (may be replaced by boolean) array instead of hash table. PROS: linear complexity, CONS: needs extra memory for flag array.

string Text = "012345678901234567890123456789";
List<int> Indexes = new List<int>() { 2, 4, 7, 9, 15, 18, 23, 10, 1, 2, 15, 40 };
byte[] contains = new byte[Text.Length];
Indexes.ForEach(p=> {if ( p<Text.Length) contains[p]=1;});
var output = string.Concat(Enumerable.Range(0, Text.Length).Where(p => contains[p] != 1).Select(p => Text[p]));

1 Comment

Providing extra information to clarify your answer is a good practice, instead of just posting code.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.