Changing the Delimiter in a large CSV file using Powershell

Question

I am in need of a way to change the delimiter in a CSV file from a comma to a pipe. Because of the size of the CSV files (~750 Mb to several Gb), using Import-CSV and/or Get-Content is not an option. What I'm using (and what works, albeit slowly) is the following code:

$reader = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser $source
$reader.SetDelimiters(",")

While(!$reader.EndOfData)
{   
    $line = $reader.ReadFields()
    $details = [ordered]@{
                            "Plugin ID" = $line[0]
                            CVE = $line[1]
                            CVSS = $line[2]
                            Risk = $line[3]     
                         }                        
    $export = New-Object PSObject -Property $details
    $export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"    
}

This little loop took nearly 2 minutes to process a 20 Mb file. Scaling up at this speed would mean over an hour for the smallest CSV file I'm currently working with.

I've tried this as well:

While(!$reader.EndOfData)
{   
    $line = $reader.ReadFields()  

    $details = [ordered]@{
                             # Same data as before
                         }

    $export.Add($details) | Out-Null        
}

$export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"

This is MUCH FASTER but doesn't provide the right information in the new CSV. Instead I get rows and rows of this:

"Count"|"IsReadOnly"|"Keys"|"Values"|"IsFixedSize"|"SyncRoot"|"IsSynchronized"
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False"
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False"

So, two questions:

1) Can the first block of code be made faster? 2) How can I unwrap the arraylist in the second example to get to the actual data?

EDIT: Sample data found here - http://pastebin.com/6L98jGNg

Do the CSV files contain commas in the data? If not, reading the file line-by-line and replacing the commas with pipes is likely to be much faster. — Andrew Morton
– Andrew Morton, Commented Sep 16, 2016 at 17:12
Does Data, removed to keep the post small mean that you are processing the CSV as well as using pipes? — Andrew Morton
– Andrew Morton, Commented Sep 16, 2016 at 17:17
@AndrewMorton, yes. Commas and newlines. I've added a few more lines to see what's happening. I'm not piping anything, just adding data from the CSV into the $details variable. — Tchotchke
– Tchotchke, Commented Sep 16, 2016 at 17:20
1) If there are newlines in the data (in addition to the newlines at the end of each row), then you probably need to set the TextFieldParser.HasFieldsEnclosedInQuotes Property to True. 2) Your second method will be storing all the data to be output in RAM, so it is likely to either run out of RAM or become very slow (due to paging to the hard drive) at larger file sizes. 3) Could you give us a few sample lines of the CSV file? — Andrew Morton
– Andrew Morton, Commented Sep 16, 2016 at 17:34
What do you know about your specific CSV format? Processing the raw text data without having to build PSCustomObjects or even arrays for every row would be an enormous reduction in overhead, but writing your own full CSV parser would take enough time to negate that. But if you can say anything with confidence like "all fields are quoted with double quotes", you might be able to build a good-enough parser to replace the separator and avoid changing values... — TessellatingHeckler
– TessellatingHeckler, Commented Sep 16, 2016 at 17:36

woxxom · Accepted Answer · 2016-09-17 20:25:12Z

This is simple text-processing, so the bottleneck should be disk read speed: 1 second per 100 MB or 10 seconds per 1GB for the OP's sample (repeated to the mentioned size) as measured here on i7. The results would be worse for files with many/all small quoted fields.

The algo is simple:

Read the file in big string chunks e.g. 1MB.
It's much faster than reading millions of lines separated by CR/LF because:
- less checks are performed as we mostly/primarily look only for doublequotes;
- less iterations of our code executed by the interpreter which is slow.
Find the next doublequote.
Depending on the current $inQuotedField flag decide whether the found doublequote starts a quoted field (should be preceded by , + some spaces optionally) or ends the current quoted field (should be followed by any even number of doublequotes, optionally spaces, then ,).
Replace delimiters in the preceding span or to the end of 1MB chunk if no quotes were found.

The code makes some reasonable assumptions but it may fail to detect an escaped field if its doublequote is followed or preceded by more than 3 spaces before/after field delimiter. The checks won't be too hard to add, and I might've missed some other edge case, but I'm not that interested.

$sourcePath = 'c:\path\file.csv'
$targetPath = 'd:\path\file2.csv'
$targetEncoding = [Text.UTF8Encoding]::new($false) # no BOM

$delim = [char]','
$newDelim = [char]'|'

$buf = [char[]]::new(1MB)
$sourceBase = [IO.FileStream]::new(
    $sourcePath,
    [IO.FileMode]::open,
    [IO.FileAccess]::read,
    [IO.FileShare]::read,
    $buf.length,  # let OS prefetch the next chunk in background
    [IO.FileOptions]::SequentialScan)
$source = [IO.StreamReader]::new($sourceBase, $true) # autodetect encoding
$target = [IO.StreamWriter]::new($targetPath, $false, $targetEncoding, $buf.length)

$bufStart = 0
$bufPadding = 4
$inQuotedField = $false
$fieldBreak = [char[]]@($delim, "`r", "`n")
$out = [Text.StringBuilder]::new($buf.length)

while ($nRead = $source.Read($buf, $bufStart, $buf.length-$bufStart)) {
    $s = [string]::new($buf, 0, $nRead+$bufStart)
    $len = $s.length
    $pos = 0
    $out.Clear() >$null

    do {
        $iQuote = $s.IndexOf([char]'"', $pos)
        if ($inQuotedField) {
            $iDelim = if ($iQuote -ge 0) { $s.IndexOf($delim, $iQuote+1) }
            if ($iDelim -eq -1 -or $iQuote -le 0 -or $iQuote -ge $len - $bufPadding) {
                # no closing quote in buffer safezone
                $out.Append($s.Substring($pos, $len-$bufPadding-$pos)) >$null
                break
            }
            if ($s.Substring($iQuote, $iDelim-$iQuote+1) -match "^(""+)\s*$delim`$") {
                # even number of quotes are just quoted quotes
                $inQuotedField = $matches[1].length % 2 -eq 0
            }
            $out.Append($s.Substring($pos, $iDelim-$pos+1)) >$null
            $pos = $iDelim + 1
            continue
        }
        if ($iQuote -ge 0) {
            $iDelim = $s.LastIndexOfAny($fieldBreak, $iQuote)
            if (!$s.Substring($iDelim+1, $iQuote-$iDelim-1).Trim()) {
                $inQuotedField = $true
            }
            $replaced = $s.Substring($pos, $iQuote-$pos+1).Replace($delim, $newDelim)
        } elseif ($pos -gt 0) {
            $replaced = $s.Substring($pos).Replace($delim, $newDelim)
        } else {
            $replaced = $s.Replace($delim, $newDelim)
        }
        $out.Append($replaced) >$null
        $pos = $iQuote + 1
    } while ($iQuote -ge 0)

    $target.Write($out)

    $bufStart = 0
    for ($i = $out.length; $i -lt $s.length; $i++) {
        $buf[$bufStart++] = $buf[$i]
    }
}
if ($bufStart) { $target.Write($buf, 0, $bufStart) }
$source.Close()
$target.Close()

Thanks for the example. I was able to use this, modify it a bit, and churn through my biggest files in seconds. :D

TheMadTechnician · Accepted Answer · 2016-09-16 21:04:56Z

0

Still not what I would call fast, but this is considerably faster than what you have listed by using the -Join operator:

$reader = New-Object Microsoft.VisualBasic.fileio.textfieldparser $source
$reader.SetDelimiters(",")

While(!$reader.EndOfData){
    $line = $reader.ReadFields()
    $line -join '|' | Add-Content C:\Temp\TestOutput.csv
}

That took a hair under 32 seconds to process a 20MB file. At that rate your 750MB file would be done in under 20 minutes, and bigger files should go at about 26 minutes per gig.

answered Sep 16, 2016 at 21:04

TheMadTechnician

36.5k3 gold badges48 silver badges63 bronze badges

Collectives™ on Stack Overflow

Changing the Delimiter in a large CSV file using Powershell

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related