How to write 1GB file in efficient way C#

Question

I have .txt file (contains more than million rows) which is around 1GB and I have one list of string, I am trying to remove all the rows from the file that exist in the list of strings and creating new file but it is taking long long time.

using (StreamReader reader = new StreamReader(_inputFileName))
{
   using (StreamWriter writer = new StreamWriter(_outputFileName))
   {
     string line;
     while ((line = reader.ReadLine()) != null)
     {
       if (!_lstLineToRemove.Contains(line))
              writer.WriteLine(line);
     }

    }
  }

How can I enhance the performance of my code?

One simple way would be to convert _lstLineToRemove from List<string> to HashSet<string> (assuming it's not hash set already) — ghord
– ghord, Commented Apr 20, 2016 at 12:48
The speed is roughly 1Mb/s, this seems pretty slow. From where you read and to where you write? HDD? SSD? Flash? When reading and writing from same physical drive speed is reduced. What if you remove check and let it write all lines? How fast it will be? If it's same 15min, then bottleneck is file system. If severely less, then there is a way to optimize algorithm. — Sinatr
– Sinatr, Commented Apr 20, 2016 at 13:08
Another guess would be to replace your List<String> with a HashSet<String>. — Oliver
– Oliver, Commented Apr 20, 2016 at 13:09
You won't really be able to speed up the IO beyond what you have. But as others have pointed out, replacing List<string> with HashSet<string> is likely to help a lot, especially if _lstLineToRemove contains hundreds of lines. — Matthew Watson
– Matthew Watson, Commented Apr 20, 2016 at 13:12
Question: How long does it take to simply COPY this file? you cannot get any faster than this benchmark, so please note that first. Then, tell me, what is the time your application takes? — Zabi
– Zabi, Commented Apr 20, 2016 at 14:07

Antony · Accepted Answer · 2016-04-20 14:31:59Z

4

You may get some speedup by using PLINQ to do the work in parallel, also switching from a list to a hash set will also greatly speed up the Contains( check. HashSet is thread safe for read-only operations.

private HashSet<string> _hshLineToRemove;

void ProcessFiles()
{
    var inputLines = File.ReadLines(_inputFileName);
    var filteredInputLines = inputLines.AsParallel().AsOrdered().Where(line => !_hshLineToRemove.Contains(line));
    File.WriteAllLines(_outputFileName, filteredInputLines);
}

If it does not matter that the output file be in the same order as the input file you can remove the .AsOrdered() and get some additional speed.

Beyond this you are really just I/O bound, the only way to make it any faster is to get faster drives to run it on.

edited Apr 20, 2016 at 14:31

Antony

1,2411 gold badge11 silver badges18 bronze badges

answered Apr 20, 2016 at 14:25

Scott Chamberlain

128k37 gold badges299 silver badges448 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Panagiotis Kanavos Over a year ago

This is no different than the other answer so the same comment applies: Loading the data is an IO operation with enough lag to allow filtering while reading each line. Loading everything in memory will only slow filtering dramatically and increase CPU usage due to looping. Log processing code does not load everything in memory

Scott Chamberlain Over a year ago

@PanagiotisKanavos This does not load everything in to memory, I am using ReadLines not ReadAllLines like the other answer so the processing is streamed instead of loaded in to memory all at once. Also I did post a update as you where typing that does address the I/O

CSharpie Over a year ago

It should be => !_hshLineToRemove.Contains(...

Panagiotis Kanavos Over a year ago

I suspect ReadLines has a greater effect than PLINQ. In fact, I'd use TPL Dataflow to split reading from writing. I'd also move the target file to a different drive. I suspect this would improve performance more than multithreading. Only if that wasn't enough, I'd add a filtering block

Scott Chamberlain Over a year ago

@PanagiotisKanavos I originally was going to write a dataflow answer but thought it was too overkill for this problem. However if you wrote up a Dataflow implementation of this I would upvote it.

|

Ian Mercer · Accepted Answer · 2016-04-20 14:33:11Z

0

The code is particularly slow because the reader and writer never execute in parallel. Each has to wait for the other.

You can almost double the speed of file operations like this by having a reader thread and a writer thread. Put a BlockingCollection between them so you can communicate between the threads and limit how many rows you buffer in memory.

If the computation is really expensive (it isn't in your case), a third thread with another BlockingCollection doing the processing can help too.

answered Apr 20, 2016 at 14:33

Ian Mercer

39.4k8 gold badges103 silver badges139 bronze badges

3 Comments

Panagiotis Kanavos Over a year ago

That's already available out-of-the-box with Dataflow's ActionBlock.

Scott Chamberlain Over a year ago

For new development I honestly don't use BlockingCollection pipelines anymore, I have switched over to TPL Dataflow, it gives you the same process as a manual BlockingCollection based pipeline with separate Task for each stage of the work but it wraps it up it in to nice container classes so you don't need to deal with the collections or starting up the tasks.

Panagiotis Kanavos Over a year ago

@ScottChamberlain and it allows to to throttle the reader so it doesn't fill the buffer with all lines, if the writer is slow

Jonathan Wood · Accepted Answer · 2016-04-20 14:34:49Z

0

Do not use buffered text routines. Use binary, unbuffered library routines and make your buffer size as big as possible. That's how to make it the fastest.

answered Apr 20, 2016 at 14:34

Jonathan Wood

68.1k86 gold badges310 silver badges542 bronze badges

Comments

MSE · Accepted Answer · 2016-04-20 15:49:48Z

0

Have you considered using AWK

AWK is a very powerfull tool to process text files, you can find more information about how to filter lines that match a certain criteria Filter text with ASK

answered Apr 20, 2016 at 15:49

MSE

3453 silver badges8 bronze badges

Collectives™ on Stack Overflow

How to write 1GB file in efficient way C#

4 Answers 4

7 Comments

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

7 Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related