Remove chars from string

Question

I have a string like

string Text = "012345678901234567890123456789";

and a List<int> with indexes

List<int> Indexes = new List<int>() { 2, 4, 7, 9, 15, 18, 23, 10, 1, 2, 15, 40 };

with following restrictions

there are duplicates within the list
the list is not sorted
there may be indexes > Text.length

what's the best way to remove characters from the text which are in the index list?

expected output:

035681234679012456789

Is there a more efficent way than

foreach (int index in Indexes
                        .OrderByDescending(x => x)
                        .Distinct()
                        .Where(x => x < Text.Length))
{
    Text = Text.Remove(index, 1);
}

Update: Here are the Benchmarks of the current answers (string with 100.000 characters and List<int> with length 10.000:

Gallant: 3.322 ticks
Tim Schmelter: 8.602.576 ticks
Sergei Zinovyev: 9.002 ticks
rbaghbanli: 7.137 ticks
Jirí Tesil Tesarík: 72.580 ticks

The question is why do you need the list at all? Can't you use a HashSet<int> in the first place instead? Then you don't have duplicates and a lookup with Contains is much more efficient O(1). — Tim Schmelter
– Tim Schmelter, Commented Apr 11, 2016 at 15:25

Tim Schmelter · Accepted Answer · 2016-04-11 13:52:08Z

11

Here's a more or less elegant LINQ way:

Text = new string(Text.Where((c, index) => !Indexes.Contains(index)).ToArray());

It uses the overload of Enumerable.Where that projects the index of the item in the sequence.

If you want the most efficient and not the most readable way and the text is really large you could use a HashSet<int> instead of the list which doesn't allow duplicates and a StringBuilder to create the new string:

var indexSet = new HashSet<int>(Indexes); // either create from the list(as shown here) or use it without your list
var textBuilder = new StringBuilder(Text.Length);

for(int i = 0; i < Text.Length; i++)
    if (!indexSet.Contains(i))
        textBuilder.Append(Text[i]);
Text = textBuilder.ToString();

Of course you could also use the HashSet<int> in the LINQ approach to make it more efficient.

edited Apr 11, 2016 at 13:52

answered Apr 11, 2016 at 13:38

Tim Schmelter

462k79 gold badges719 silver badges980 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Riad Baghbanli Over a year ago

That will be O(n^2). There is a solution with O(n).

juharr Over a year ago

If you use the HashSet in the Linq approach that would still be O(n).

Riad Baghbanli Over a year ago

Building binary tree is not fast. Whatever you win in the loop, you are loosing at the construction of binary tree.

Riad Baghbanli Over a year ago

@Tim Schmelter, if you get HashSet as an input - that is an assumption, don't you think? OP explicitly indicated that Indexes is List<int>, not HashSet<int>.

Tim Schmelter Over a year ago

@rbaghbanli: you are making more assumptions. It could be that OP isn't ware of the HashSet class. Also, if the list doesn't change (often) you can also hold the set as lookup instance whereas you keep the list. Then you gain performance with a little bit more memory costs. However, it's worth to mention it even if OP knows it and can't use it. In many other cases it will help to improve performance.

|

Tunaki · Accepted Answer · 2016-04-11 15:54:17Z

9

This will work faster:

string Text = "012345678901234567890123456789";
List<int> Indexes = new List<int>() { 2, 4, 7, 9, 15, 18, 23, 10, 1, 2, 15, 40 };

HashSet<int> hashSet = new HashSet<int>(Indexes);

StringBuilder sb = new StringBuilder(Text.Length);
for (int i = 0; i < Text.Length; ++i)
{
    if (!hashSet.Contains(i))
    {
        sb.Append(Text[i]);
    }
}

string str = sb.ToString();

edited Apr 11, 2016 at 15:54

Tunaki

138k46 gold badges370 silver badges443 bronze badges

answered Apr 11, 2016 at 13:41

Sergei Zinovyev

1,30614 silver badges17 bronze badges

6 Comments

Riad Baghbanli Over a year ago

HashSet<int> hashSet = new HashSet<int>(Indexes) - is not fast at all.

Taemyr Over a year ago

@rbaghbanli Compared to .OrderByDescending(x=>x).Distinct() it's fast.

Riad Baghbanli Over a year ago

True, but there are faster solutions.

Riad Baghbanli Over a year ago

Did you include the creation of HashSet within the test and what sizes you benchmarked it over?

Riad Baghbanli Over a year ago

Building HashSet is O(n^2). Have a look, for example, here: stackoverflow.com/questions/15322386/…

|

Riad Baghbanli · Accepted Answer · 2016-04-11 14:20:01Z

7

Yes, see code below (it will iterate only once over each sequence):

var map = new short[Text.Length];
foreach (var i in Indexes)
{
    if (i < text.Count)
        map[i] = 1;
}
Text = new string(Text.Where((c, i) => map[i] == 0).ToArray());

edited Apr 11, 2016 at 14:20

answered Apr 11, 2016 at 13:40

Riad Baghbanli

3,3461 gold badge17 silver badges23 bronze badges

11 Comments

juharr Over a year ago

By calling RemoveAt you alter the indexes of the array. So this only works if you start with indexes at the end of the array and remove the duplicates. That's why the OP was using OrderByDescending and Distinct.

Riad Baghbanli Over a year ago

Judging by the code from OP that is desired behaviour.

juharr Over a year ago

No. Imagine if the Indexes as just the number 0 repeated 29 times. The OP's code would only remove the first character while your code would remove all of them. It's sad how many people don't realize that this code is completely wrong.

Riad Baghbanli Over a year ago

I am not sure that building binary tree is fast process. The longer the Indexes the longer it will take to build HashSet.

juharr Over a year ago

Why is it more obvious that short defaults to 0 than bool defaults to false? IMO bool is better because you can name the array something like isInList and then do Where((c,i) => !isInList[i]). And you avoid using magic numbers.

|

Gallant · Accepted Answer · 2016-04-12 17:14:14Z

5

The following is making an assumption that your string contains a known set of characters. If you know for certain that, for example, Unicode character ￰ never appears in the string, you can use it as a placeholder to mark which characters to remove. This should be very fast in exchange for this limitation:

char temp = '\uFFF0';
StringBuilder sb = new StringBuilder(Text);
for (int i = 0; i < Indexes.Count; i++)
{
    if (Indexes[i] < sb.Length)
    {
        sb[Indexes[i]] = temp;
    }
}

Text = sb.Replace(temp.ToString(), null).ToString();

This appears to be anywhere between 3-4 times faster than building a HashSet like some other answers have suggested. http://ideone.com/mUILHg

If you cannot make the above assumption, you can build an array to contain this extra bit of data instead of using a unique character. This does two rounds of iteration (so it's a bit slower), but it is still O(n) efficiency (so it should typically be faster than putting Indexes into a hashmap before iterating).

bool[] exclude = new bool[Text.Length];
for (int i = 0; i < Indexes.Count; i++)
{
    if (Indexes[i] < exclude.Length)
    {
        exclude[Indexes[i]] = true;
    }
}
StringBuilder sb = new StringBuilder(Text.Length);
for (int i = 0; i < Text.Length; i++)
{
    if (!exclude[i])
    {
        sb.Append(Text[i]);
    }
}
Text = sb.ToString();

Quick benchmarks: http://ideone.com/3d2uPH

edited Apr 12, 2016 at 17:14

answered Apr 11, 2016 at 18:25

Gallant

1661 silver badge3 bronze badges

4 Comments

Byyo Over a year ago

Why didn't you use the null char? stackoverflow.com/questions/3670505/…

Slai Over a year ago

surprisingly char array gave me the same speed:

var a = Text.ToCharArray(); foreach (int i in Indexes) if (i < Text.Length) a[i] = ' '; var result = new string(a).Replace(" ", "");

Slai Over a year ago

never mind .. char array is faster for me when the Indexes are more than the characters, but gets slower when the characters are 10 times or more than the Indexes

Gallant Over a year ago

@Byyo That works as well, but valid strings can contain the null character (which is just Unicode character \u0000), so the issue remains.

Jiří Tesil Tesařík · Accepted Answer · 2016-04-13 13:35:30Z

0

A modified solution using byte (may be replaced by boolean) array instead of hash table. PROS: linear complexity, CONS: needs extra memory for flag array.

string Text = "012345678901234567890123456789";
List<int> Indexes = new List<int>() { 2, 4, 7, 9, 15, 18, 23, 10, 1, 2, 15, 40 };
byte[] contains = new byte[Text.Length];
Indexes.ForEach(p=> {if ( p<Text.Length) contains[p]=1;});
var output = string.Concat(Enumerable.Range(0, Text.Length).Where(p => contains[p] != 1).Select(p => Text[p]));

edited Apr 13, 2016 at 13:35

answered Apr 12, 2016 at 11:25

Jiří Tesil Tesařík

111 silver badge3 bronze badges

1 Comment

Phiter Over a year ago

Providing extra information to clarify your answer is a good practice, instead of just posting code.

Collectives™ on Stack Overflow

Remove chars from string

5 Answers 5

8 Comments

6 Comments

11 Comments

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

8 Comments

6 Comments

11 Comments

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related