11

I have .txt file (contains more than million rows) which is around 1GB and I have one list of string, I am trying to remove all the rows from the file that exist in the list of strings and creating new file but it is taking long long time.

using (StreamReader reader = new StreamReader(_inputFileName))
{
   using (StreamWriter writer = new StreamWriter(_outputFileName))
   {
     string line;
     while ((line = reader.ReadLine()) != null)
     {
       if (!_lstLineToRemove.Contains(line))
              writer.WriteLine(line);
     }

    }
  }

How can I enhance the performance of my code?

31
  • 13
    One simple way would be to convert _lstLineToRemove from List<string> to HashSet<string> (assuming it's not hash set already) Commented Apr 20, 2016 at 12:48
  • 1
    The speed is roughly 1Mb/s, this seems pretty slow. From where you read and to where you write? HDD? SSD? Flash? When reading and writing from same physical drive speed is reduced. What if you remove check and let it write all lines? How fast it will be? If it's same 15min, then bottleneck is file system. If severely less, then there is a way to optimize algorithm. Commented Apr 20, 2016 at 13:08
  • 1
    Another guess would be to replace your List<String> with a HashSet<String>. Commented Apr 20, 2016 at 13:09
  • 1
    You won't really be able to speed up the IO beyond what you have. But as others have pointed out, replacing List<string> with HashSet<string> is likely to help a lot, especially if _lstLineToRemove contains hundreds of lines. Commented Apr 20, 2016 at 13:12
  • 2
    Question: How long does it take to simply COPY this file? you cannot get any faster than this benchmark, so please note that first. Then, tell me, what is the time your application takes? Commented Apr 20, 2016 at 14:07

4 Answers 4

4

You may get some speedup by using PLINQ to do the work in parallel, also switching from a list to a hash set will also greatly speed up the Contains( check. HashSet is thread safe for read-only operations.

private HashSet<string> _hshLineToRemove;

void ProcessFiles()
{
    var inputLines = File.ReadLines(_inputFileName);
    var filteredInputLines = inputLines.AsParallel().AsOrdered().Where(line => !_hshLineToRemove.Contains(line));
    File.WriteAllLines(_outputFileName, filteredInputLines);
}

If it does not matter that the output file be in the same order as the input file you can remove the .AsOrdered() and get some additional speed.

Beyond this you are really just I/O bound, the only way to make it any faster is to get faster drives to run it on.

Sign up to request clarification or add additional context in comments.

7 Comments

This is no different than the other answer so the same comment applies: Loading the data is an IO operation with enough lag to allow filtering while reading each line. Loading everything in memory will only slow filtering dramatically and increase CPU usage due to looping. Log processing code does not load everything in memory
@PanagiotisKanavos This does not load everything in to memory, I am using ReadLines not ReadAllLines like the other answer so the processing is streamed instead of loaded in to memory all at once. Also I did post a update as you where typing that does address the I/O
It should be => !_hshLineToRemove.Contains(...
I suspect ReadLines has a greater effect than PLINQ. In fact, I'd use TPL Dataflow to split reading from writing. I'd also move the target file to a different drive. I suspect this would improve performance more than multithreading. Only if that wasn't enough, I'd add a filtering block
@PanagiotisKanavos I originally was going to write a dataflow answer but thought it was too overkill for this problem. However if you wrote up a Dataflow implementation of this I would upvote it.
|
0

The code is particularly slow because the reader and writer never execute in parallel. Each has to wait for the other.

You can almost double the speed of file operations like this by having a reader thread and a writer thread. Put a BlockingCollection between them so you can communicate between the threads and limit how many rows you buffer in memory.

If the computation is really expensive (it isn't in your case), a third thread with another BlockingCollection doing the processing can help too.

3 Comments

That's already available out-of-the-box with Dataflow's ActionBlock.
For new development I honestly don't use BlockingCollection pipelines anymore, I have switched over to TPL Dataflow, it gives you the same process as a manual BlockingCollection based pipeline with separate Task for each stage of the work but it wraps it up it in to nice container classes so you don't need to deal with the collections or starting up the tasks.
@ScottChamberlain and it allows to to throttle the reader so it doesn't fill the buffer with all lines, if the writer is slow
0

Do not use buffered text routines. Use binary, unbuffered library routines and make your buffer size as big as possible. That's how to make it the fastest.

Comments

0

Have you considered using AWK

AWK is a very powerfull tool to process text files, you can find more information about how to filter lines that match a certain criteria Filter text with ASK

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.