1

I am writing a simple script (as I thought) to replace some strings in CSV files. Those strings are so called "keys" of objects. I basically replace the "old key" in the files with a "new key".

function simpleStringReplacement {
    param (
        $sourceFiles,  # list of csv files in which we do need to replace contents
        $mappingList,  # a file that contains 2 columns: The old key and the new key
        $exportFolder, # folder where i expect the results
        $FieldsToSelectFromTargetFilesIntoMappingFile # As the names of the fields that contain the values for replacements change, i have that in this array
    )
    $totalitems = $sourceFiles.count
    $currentrow = 0
    Write-Output "Importing mapper file $mappingList" | logText
    $findReplaceList = Import-Csv -Path $mappingList -Delimiter   ';'
    foreach ($sourceFile in $sourceFiles) {
        $currentrow += 1
        Write-Output "Working on  $currentrow : $sourceFile" | logText
        [string] $txtsourceFile = Get-Content $sourceFile.FullName | Out-String
        $IssueKey = $FieldsToSelectFromTargetFilesIntoMappingFile[0]
        $OldIssueKey = $FieldsToSelectFromTargetFilesIntoMappingFile[1]

 ForEach ($findReplaceItem in $findReplaceList) {
          $txtsourceFile = $txtsourceFile -replace  $findReplaceitem.$OldIssueKey , $findReplaceitem.$IssueKey
        }
        $outputFileName = $sourceFile.Name.Substring(0, $sourceFile.Name.IndexOf('.csv') ) + "_newIDs.csv"
        $outputFullFileName =Join-Path -Path $exportFolder -ChildPath $outputFileName
        Write-Output "Writing result to  $currentrow : $outputFullFileName" | logText
        $txtsourceFile | Set-Content -path $outputFullFileName
    }
}

The issue I have: already when the script is working on the first file (first iteration of the outer loop) i get:

Insufficient memory to continue the execution of the program.

And this error is referencing my code line with the replacement:

$txtsourceFile = $txtsourceFile -replace  $findReplaceitem.$OldIssueKey , $findReplaceitem.$IssueKey

The csv files are "big" but really not that big..
The mappingList is 1.7 MB Each Source File is around 1.5 MB

I can't really understand how i run into memory issues with these file sizes. And ofc. I have no idea how to avoid that problem

I found some blogs talking about memory issues in PS. They all end up changing the PowerShell MaxMemoryPerShellMB quota defaults. That somehow doesn't work at all for me as I run into an error with

get-item WSMAN:\localhost\shell\MaxMemoryPerShellMB

Saying "get-item : Cannot find path 'WSMan:\localhost\Shell\MaxMemorPerShellMB' because it does not exist."

I am working in VS Code.

4
  • short update: If i check the sysem memory consumption during execution: Process Windows Powershell takes up to 3.2 GB before it is stopped with the exception.. Commented Oct 30, 2019 at 18:28
  • How many issue keys might there be in the $mappingList file? And for a given $sourceFile how many of its keys might be remapped? Though both files are less than a mere 2 MB, every time the error line you referenced results in a change it will produce a slightly different but still entirely new [String] object representing the complete source file. If you have, say, 10,000 mappings defined and 1,000 of them are found in the source file, that's 1,000 × 1.7 MB = 1.7 GB of garbage to collect. The math gets worse if the mappings are shorter but greater in number. Commented Oct 30, 2019 at 19:38
  • @BACON is suggesting the same thing I was thinking, but I don't know enough about gc in PowerShell. Are you sure it was only in the copy? The same misspelling is in the error message? Commented Oct 30, 2019 at 19:40
  • Also, when you say you're replacing "keys", are you remapping entire column (cell) values, or is it arbitrary search text that could be a substring of a value (like a profanity filter)? This could be processed line-by-line using a [Hashtable]/[Dictionary] to perform the mappings, which should greatly reduce the run-time as well as memory usage, but it would require that entire values are being replaced. Commented Oct 30, 2019 at 19:57

2 Answers 2

1

As @BACON alludes, the core issue here is caused by looping through (likely) several thousand replacements.

Every time the replacement line executes:

$txtsourceFile = $txtsourceFile -replace  $findReplaceitem.$OldIssueKey , $findReplaceitem.$IssueKey

PowerShell first has a chunk of memory for the $txtsourceFile. It allocates a new chunk of memory to store a copy of the data after the text replacements.

This is normally "ok" as you will have one valid chunk of memory with the replacement text, and an "invalid" copy with the original text. Since most people have (relatively) lots of memory, and we normally can handle this "leaking" in .NET by periodically running a garbage collector in the background to "clean up" this invalid data.

The trouble we run into is that when we loop several thousand times rapidly, we generate several thousand copies of the data rapidly as well. You eventually run out of available free memory before the Garbage Collector has a chance to run and clean up the thousands of invalid copies of data (i.e. 3.2GB). See: No garbage collection while PowerShell pipeline is executing

There are a couple of ways to work around this:

Solution 1: The Big and Slow Method and Inefficient Way

If you need to work with the whole file (i.e. across newlines) you can use the same code and manually run the Garbage Collector periodically during the execution to manage the memory "better":

$count = 0

ForEach ($findReplaceItem in $findReplaceList) {
    $txtsourceFile = $txtsourceFile -replace  $findReplaceitem.$OldIssueKey, $findReplaceitem.$IssueKey

    if(($count % 200) -eq 0)
    {
        [System.GC]::GetTotalMemory('forceFullCollection') | out-null
    }
    $count++
}

This does 2 things:

  1. Run the Garbage Collection every 200 loops ($count modulus 200).
  2. Stop the current execution and force the collection.

Note:

Normally you use:

[GC]::Collect()

But according to Addressing the PowerShell Garbage Collection bug at J House Consulting this doesn't always work when trying to force the collection inside a loop. Using:

[System.GC]::GetTotalMemory('forceFullCollection')

Fully stops execution until the Garbage collection is complete before resuming.

Solution 2: The Faster, More memory efficient method, one line at a time

If you can perform all the replacements one line at a time, then you can use the [System.IO.StreamReader] to stream in the file and process one line at a time and [System.IO.StreamWriter] to write it.

try
{
    $SR = New-Object -TypeName System.IO.StreamReader -ArgumentList $sourceFile.FullName
    $SW = [System.IO.StreamWriter] $outputFullFileName

    while ($line = $SR.ReadLine()) {
        #Loop through Replacements
        ForEach ($findReplaceItem in $findReplaceList) {
            $Output = $line -replace  $findReplaceitem.$OldIssueKey, $findReplaceitem.$IssueKey
        }
        $SW.WriteLine($output)
    }

    $SR.Close() | Out-Null
    $SW.Close() | Out-Null
}
finally
{
    #Cleanup
    if ($SR -ne $null)
    {
        $SR.dispose()
    }
    if ($SW -ne $null)
    {
        $SW.dispose()
    }
}

This should run an order of magnitude faster because you will be working a line at a time and won't be creating thousands of copies of the entire file with every replacement.

Sign up to request clarification or add additional context in comments.

1 Comment

I found the answer and comments above very helpful and implemented a soluion that is close to the answer here: I split the $findReplaceList in multiple batches (it is around 37000 entries long, i started splitting into 1000) and work on bath by batch with GC in-between.
0

I found the answer and comments above very helpful and implemented a solution that is close to the answer here: I split the $findReplaceList in multiple batches (it is around 37000 entries long, i started splitting into 1000) and work on bath by batch with GC in-between. Now i can watch the memory usage climb up during a batch and jump down again when one is done.

With that I found an interesting behavior: the memory issue came still up in a few of batches... So I analysed the findReplaceList further with the following result:

There are cases where there are NO $OldIssueKey in the file..

Can it be that PS then sees that as an empty string and tries to replace all those?

1 Comment

That produces a very interesting result! Matching on an empty string: "abc" -replace "","z" returns zazbzcz it looks like it matches every single character (including the EOL character) and replaces it with the replacement text plus the existing character. So if you have a huge file, it would definitely encounter some additional memory issues if it matches every character.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.