I'm working with some multi-gigabyte text files and want to do some stream processing on them using PowerShell. It's simple stuff, just parsing each line and pulling out some data, then storing it in a database.
Unfortunately, get-content | %{ whatever($_) } appears to keep the entire set of lines at this stage of the pipe in memory. It's also surprisingly slow, taking a very long time to actually read it all in.
So my question is two parts:
- How can I make it process the stream line by line and not keep the entire thing buffered in memory? I would like to avoid using up several gigs of RAM for this purpose.
- How can I make it run faster? PowerShell iterating over a
get-contentappears to be 100x slower than a C# script.
I'm hoping there's something dumb I'm doing here, like missing a -LineBufferSize parameter or something...
get-contentup, set -ReadCount to 512. Note that at this point, $_ in the Foreach will be an array of strings.Get-Contentto a variable as that will load the entire file into memory. By default, in a pipleline,Get-Contentprocesses the file one line at a time. As long as you aren't accumulating the results or using a cmdlet which internally accumulates (like Sort-Object and Group-Object) then the memory hit shouldn't be too bad. Foreach-Object (%) is a safe way to process each line, one at a time.get-content | % -Process { whatever($_) }if you want it to execute on each line as they come in.get-content | % -End { }then it complains because you haven't provided a process block. So it can't be using -End by default, it must be using -Process by default. And try1..5 | % -process { } -end { 'q' }and see that the end block only happens once, the usualgc | % { $_ }wouldn't work if the scriptblock defaulted to being -End...