1

I have a large CSV file in which some fields have a new line embedded. Excel 2016 produces errors when importing a CSV with rows which have fields with a new line embedded.

Based on this post, I wrote code to replace any new line in any field with a space. Below is a code block that duplicates the functionality and issue. Option 1 works. Option 2, which is commented out, casts my object to a string. I was hoping Option 2 might run faster.

Question: Is there a better way to do this to optimize for performance processing very large files?

$array = @([PSCustomObject]@{"ID"="1"; "Name"="Joe`nSmith"},
           [PSCustomObject]@{"ID"="2"; "Name"="Jasmine Baker"})

$array = $array | ForEach-Object {
    #Option 1: produces an Object, but is code optimized?
    foreach ($n in $_.PSObject.Properties.Name) {
        $_.PSObject.Properties[$n].Value = ` 
             $_.PSObject.Properties[$n].Value -replace "`n"," "
    }

    #Option 2: produces a string, not an object
    #$_ = $_ -replace "`n"," " 

    $_
 }

Keep in mind that in my real-world use case, each row has > 15 fields and any combination of them may have one or more new lines embedded.

4
  • I read somewhere that ForEach is faster than ForEach-Object. You could do some testing of both scenarios you mention with the stopwatch class. Utilize it like so: $resultstime = new-timespan -Minutes 1 $resultstime = [diagnostics.stopwatch]::StartNew() $resultstime.Elapsed Remember to stop it :) Commented May 29, 2017 at 14:39
  • Thanks Martin. I'll use the stopwatch class and time my original code compared with the suggestions from @wOxxOm. Commented May 29, 2017 at 21:57
  • Please do share your findings :) Commented May 30, 2017 at 7:39
  • Using my data set which is about 16.5 MB and 25,000 rows, the ForEach-Object loop ran in 36 seconds. The foreach loop ran in 11. Commented May 30, 2017 at 13:12

1 Answer 1

2
  • Use the fast TextFieldParser to read, process, and build the CSV from the file (PowerShell 3+):

    [Reflection.Assembly]::LoadWithPartialName('Microsoft.VisualBasic') >$null
    $parser = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser 'r:\1.csv'
    $parser.SetDelimiters(',')
    $header = $parser.ReadFields()
    
    $CSV = while (!$parser.EndOfData) {
        $i = 0
        $row = [ordered]@{}
        foreach ($field in $parser.ReadFields()) {
            $row[$header[$i++]] = $field.replace("`n", ' ')
        }
        [PSCustomObject]$row
    }
    
  • Or modify each field in-place in an already existing CSV array:

    foreach ($row in $CSV) {
        foreach ($field in $row.PSObject.Properties) {
            $field.value = $field.value.replace("`n", ' ')
        }
    }
    

Notes:

  1. foreach statement is much faster than piping to ForEach-Object (also aliased as foreach)
  2. $stringVariable.replace() is faster then -replace operator
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the detailed reply. I'll try both of these approaches and see which is faster.
foreach is definitely faster than Foreach-Object. the replace loop takes 36 seconds with ForEach-Object and only 11 seconds with a foreach loop.
I also give up on trying to use -replace on an object. It seems to only turn the result into a string. 11 seconds is plenty fast for a twice weekly script.
Just to link questions, I've written a similar answer for a different question, where I wrapped TextFieldParser in a reusable function.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.