How to process a file in PowerShell line-by-line as a stream

Question

I'm working with some multi-gigabyte text files and want to do some stream processing on them using PowerShell. It's simple stuff, just parsing each line and pulling out some data, then storing it in a database.

Unfortunately, get-content | %{ whatever($_) } appears to keep the entire set of lines at this stage of the pipe in memory. It's also surprisingly slow, taking a very long time to actually read it all in.

So my question is two parts:

How can I make it process the stream line by line and not keep the entire thing buffered in memory? I would like to avoid using up several gigs of RAM for this purpose.
How can I make it run faster? PowerShell iterating over a get-content appears to be 100x slower than a C# script.

I'm hoping there's something dumb I'm doing here, like missing a -LineBufferSize parameter or something...

To speed get-content up, set -ReadCount to 512. Note that at this point, $_ in the Foreach will be an array of strings. — Keith Hill
– Keith Hill, Commented Nov 16, 2010 at 14:42
Still, I'd go with Roman's suggestion of using the .NET reader - much faster. — Keith Hill
– Keith Hill, Commented Nov 16, 2010 at 16:53
To minimize buffering avoid assigning the result of Get-Content to a variable as that will load the entire file into memory. By default, in a pipleline, Get-Content processes the file one line at a time. As long as you aren't accumulating the results or using a cmdlet which internally accumulates (like Sort-Object and Group-Object) then the memory hit shouldn't be too bad. Foreach-Object (%) is a safe way to process each line, one at a time. — Keith Hill
– Keith Hill, Commented Nov 16, 2010 at 23:52
Forget the buffering, it's more to do with the Foreach-Object/% block defaulting to using -End if no property is given. Try get-content | % -Process { whatever($_) } if you want it to execute on each line as they come in. — dwarfsoft
– dwarfsoft, Commented Mar 12, 2015 at 23:06
@dwarfsoft that doesn't make any sense. The -End block only runs once after all the processing is done. You can see that if you try to use get-content | % -End { } then it complains because you haven't provided a process block. So it can't be using -End by default, it must be using -Process by default. And try 1..5 | % -process { } -end { 'q' } and see that the end block only happens once, the usual gc | % { $_ } wouldn't work if the scriptblock defaulted to being -End... — TessellatingHeckler
– TessellatingHeckler, Commented Apr 21, 2017 at 17:22

Pelais · Accepted Answer · 2016-12-06 15:13:13Z

100

If you are really about to work on multi-gigabyte text files then do not use PowerShell. Even if you find a way to read it faster processing of huge amount of lines will be slow in PowerShell anyway and you cannot avoid this. Even simple loops are expensive, say for 10 million iterations (quite real in your case) we have:

# "empty" loop: takes 10 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) {} }

# "simple" job, just output: takes 20 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) { $i } }

# "more real job": 107 seconds
measure-command { for($i=0; $i -lt 10000000; ++$i) { $i.ToString() -match '1' } }

UPDATE: If you are still not scared then try to use the .NET reader:

$reader = [System.IO.File]::OpenText("my.log")
try {
    for() {
        $line = $reader.ReadLine()
        if ($line -eq $null) { break }
        # process the line
        $line
    }
}
finally {
    $reader.Close()
}

UPDATE 2

There are comments about possibly better / shorter code. There is nothing wrong with the original code with for and it is not pseudo-code. But the shorter (shortest?) variant of the reading loop is

$reader = [System.IO.File]::OpenText("my.log")
while($null -ne ($line = $reader.ReadLine())) {
    $line
}

edited Dec 6, 2016 at 15:13

Pelais

1,7671 gold badge16 silver badges21 bronze badges

answered Nov 16, 2010 at 8:53

Roman Kuzmin

42.4k11 gold badges105 silver badges123 bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

Keith Hill Over a year ago

FYI, script compilation in PowerShell V3 improves the situation a bit. The "real job" loop went from 117 seconds on V2 to 62 seconds on V3 typed at the console. When I put the loop into a script and measured script execution on V3, it drops to 34 seconds.

BeowulfNode42 Over a year ago

oops that's supposed to be -ne for not equal. That particular do..while loop has the problem that the null at the end of the file will be processed (in this case output). To work around that too you could have for ( $line = $reader.ReadLine(); $line -ne $null; $line = $reader.ReadLine() ) { $line }

Roman Kuzmin Over a year ago

@BeowulfNode42, we can do this even shorter: while($null -ne ($line = $read.ReadLine())) {$line}. But the topic is not really about such things.

T_D Over a year ago

@RomanKuzmin +1 that while-loop snippet you commented, it's easy to understand and would make a nice answer. However your actual answer with the for(;;) leaves me puzzled, is it pseudo-code or actually legit powershell syntax? Thanks a bunch if you'd like to elaborate a bit.

Roman Kuzmin Over a year ago

for() means an infinite loop

|

tresf · Accepted Answer · 2016-12-03 02:13:45Z

53

System.IO.File.ReadLines() is perfect for this scenario. It returns all the lines of a file, but lets you begin iterating over the lines immediately which means it does not have to store the entire contents in memory.

Requires .NET 4.0 or higher.

foreach ($line in [System.IO.File]::ReadLines($filename)) {
    # do something with $line
}

http://msdn.microsoft.com/en-us/library/dd383503.aspx

edited Dec 3, 2016 at 2:13

tresf

8,0876 gold badges46 silver badges115 bronze badges

answered Oct 13, 2012 at 20:51

Despertar

22.4k11 gold badges85 silver badges81 bronze badges

3 Comments

Roman Kuzmin Over a year ago

A note is needed: .NET Framework - Supported in: 4.5, 4. Thus, this may not work in V2 or V1 on some machines.

Kellen Stuart Over a year ago

This gave me System.IO.File does not exist error, but the code above by Roman worked for me

user1751825 Over a year ago

This was just what I needed, and was easy to drop directly into an existing powershell script.

Chris Blydenstein · Accepted Answer · 2014-07-07 21:51:11Z

1

If you want to use straight PowerShell check out the below code.

$content = Get-Content C:\Users\You\Documents\test.txt
foreach ($line in $content)
{
    Write-Host $line
}

answered Jul 7, 2014 at 21:51

Chris Blydenstein

2352 silver badges2 bronze badges

1 Comment

Roman Kuzmin Over a year ago

That is what the OP wanted to get rid of because Get-Content is very slow on large files.

Steve · Accepted Answer · 2023-08-07 14:12:46Z

For those interested...

A bit of perspective on this, since I had to work with very large files.

Below are the results on a 39 GB xml file containing 56 million lines/records. The lookup text is a 10 digit number

1) GC -rc 1000 | % -match -> 183 seconds
2) GC -rc 100 | % -match  -> 182 seconds
3) GC -rc 1000 | % -like  -> 840 seconds
4) GC -rc 100 | % -like   -> 840 seconds
5) sls -simple            -> 730 seconds
6) sls                    -> 180 seconds (sls default uses regex, but pattern in my case is passed as literal text)
7) Switch -file -regex    -> 258 seconds
8) IO.File.Readline       -> 250 seconds

1 and 6 are clear winners but I have gone with 1

PS. The test was conducted on a Windows Server 2012 R2 server with PS 5.1. The server has 16 vCPUs and 64 GB memory but for this test only 1 CPU was utilised whereas the PS process memory footprint was bare minimum as the tests above make use of very little memory.

Collectives™ on Stack Overflow

How to process a file in PowerShell line-by-line as a stream

4 Answers 4

13 Comments

3 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

13 Comments

3 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related