2

I am in need of a way to change the delimiter in a CSV file from a comma to a pipe. Because of the size of the CSV files (~750 Mb to several Gb), using Import-CSV and/or Get-Content is not an option. What I'm using (and what works, albeit slowly) is the following code:

$reader = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser $source
$reader.SetDelimiters(",")

While(!$reader.EndOfData)
{   
    $line = $reader.ReadFields()
    $details = [ordered]@{
                            "Plugin ID" = $line[0]
                            CVE = $line[1]
                            CVSS = $line[2]
                            Risk = $line[3]     
                         }                        
    $export = New-Object PSObject -Property $details
    $export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"    
}

This little loop took nearly 2 minutes to process a 20 Mb file. Scaling up at this speed would mean over an hour for the smallest CSV file I'm currently working with.

I've tried this as well:

While(!$reader.EndOfData)
{   
    $line = $reader.ReadFields()  

    $details = [ordered]@{
                             # Same data as before
                         }

    $export.Add($details) | Out-Null        
}

$export | Export-Csv -Append -Delimiter "|" -Force -NoTypeInformation -Path "C:\MyFolder\Delimiter Change.csv"

This is MUCH FASTER but doesn't provide the right information in the new CSV. Instead I get rows and rows of this:

"Count"|"IsReadOnly"|"Keys"|"Values"|"IsFixedSize"|"SyncRoot"|"IsSynchronized"
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False"
"13"|"False"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"System.Collections.Specialized.OrderedDictionary+OrderedDictionaryKeyValueCollection"|"False"|"System.Object"|"False"

So, two questions:

1) Can the first block of code be made faster? 2) How can I unwrap the arraylist in the second example to get to the actual data?

EDIT: Sample data found here - http://pastebin.com/6L98jGNg

7
  • 1
    Do the CSV files contain commas in the data? If not, reading the file line-by-line and replacing the commas with pipes is likely to be much faster. Commented Sep 16, 2016 at 17:12
  • Does Data, removed to keep the post small mean that you are processing the CSV as well as using pipes? Commented Sep 16, 2016 at 17:17
  • @AndrewMorton, yes. Commas and newlines. I've added a few more lines to see what's happening. I'm not piping anything, just adding data from the CSV into the $details variable. Commented Sep 16, 2016 at 17:20
  • 1) If there are newlines in the data (in addition to the newlines at the end of each row), then you probably need to set the TextFieldParser.HasFieldsEnclosedInQuotes Property to True. 2) Your second method will be storing all the data to be output in RAM, so it is likely to either run out of RAM or become very slow (due to paging to the hard drive) at larger file sizes. 3) Could you give us a few sample lines of the CSV file? Commented Sep 16, 2016 at 17:34
  • 1
    What do you know about your specific CSV format? Processing the raw text data without having to build PSCustomObjects or even arrays for every row would be an enormous reduction in overhead, but writing your own full CSV parser would take enough time to negate that. But if you can say anything with confidence like "all fields are quoted with double quotes", you might be able to build a good-enough parser to replace the separator and avoid changing values... Commented Sep 16, 2016 at 17:36

2 Answers 2

4

This is simple text-processing, so the bottleneck should be disk read speed: 1 second per 100 MB or 10 seconds per 1GB for the OP's sample (repeated to the mentioned size) as measured here on i7. The results would be worse for files with many/all small quoted fields.

The algo is simple:

  1. Read the file in big string chunks e.g. 1MB.
    It's much faster than reading millions of lines separated by CR/LF because:
    • less checks are performed as we mostly/primarily look only for doublequotes;
    • less iterations of our code executed by the interpreter which is slow.
  2. Find the next doublequote.
  3. Depending on the current $inQuotedField flag decide whether the found doublequote starts a quoted field (should be preceded by , + some spaces optionally) or ends the current quoted field (should be followed by any even number of doublequotes, optionally spaces, then ,).
  4. Replace delimiters in the preceding span or to the end of 1MB chunk if no quotes were found.

The code makes some reasonable assumptions but it may fail to detect an escaped field if its doublequote is followed or preceded by more than 3 spaces before/after field delimiter. The checks won't be too hard to add, and I might've missed some other edge case, but I'm not that interested.

$sourcePath = 'c:\path\file.csv'
$targetPath = 'd:\path\file2.csv'
$targetEncoding = [Text.UTF8Encoding]::new($false) # no BOM

$delim = [char]','
$newDelim = [char]'|'

$buf = [char[]]::new(1MB)
$sourceBase = [IO.FileStream]::new(
    $sourcePath,
    [IO.FileMode]::open,
    [IO.FileAccess]::read,
    [IO.FileShare]::read,
    $buf.length,  # let OS prefetch the next chunk in background
    [IO.FileOptions]::SequentialScan)
$source = [IO.StreamReader]::new($sourceBase, $true) # autodetect encoding
$target = [IO.StreamWriter]::new($targetPath, $false, $targetEncoding, $buf.length)

$bufStart = 0
$bufPadding = 4
$inQuotedField = $false
$fieldBreak = [char[]]@($delim, "`r", "`n")
$out = [Text.StringBuilder]::new($buf.length)

while ($nRead = $source.Read($buf, $bufStart, $buf.length-$bufStart)) {
    $s = [string]::new($buf, 0, $nRead+$bufStart)
    $len = $s.length
    $pos = 0
    $out.Clear() >$null

    do {
        $iQuote = $s.IndexOf([char]'"', $pos)
        if ($inQuotedField) {
            $iDelim = if ($iQuote -ge 0) { $s.IndexOf($delim, $iQuote+1) }
            if ($iDelim -eq -1 -or $iQuote -le 0 -or $iQuote -ge $len - $bufPadding) {
                # no closing quote in buffer safezone
                $out.Append($s.Substring($pos, $len-$bufPadding-$pos)) >$null
                break
            }
            if ($s.Substring($iQuote, $iDelim-$iQuote+1) -match "^(""+)\s*$delim`$") {
                # even number of quotes are just quoted quotes
                $inQuotedField = $matches[1].length % 2 -eq 0
            }
            $out.Append($s.Substring($pos, $iDelim-$pos+1)) >$null
            $pos = $iDelim + 1
            continue
        }
        if ($iQuote -ge 0) {
            $iDelim = $s.LastIndexOfAny($fieldBreak, $iQuote)
            if (!$s.Substring($iDelim+1, $iQuote-$iDelim-1).Trim()) {
                $inQuotedField = $true
            }
            $replaced = $s.Substring($pos, $iQuote-$pos+1).Replace($delim, $newDelim)
        } elseif ($pos -gt 0) {
            $replaced = $s.Substring($pos).Replace($delim, $newDelim)
        } else {
            $replaced = $s.Replace($delim, $newDelim)
        }
        $out.Append($replaced) >$null
        $pos = $iQuote + 1
    } while ($iQuote -ge 0)

    $target.Write($out)

    $bufStart = 0
    for ($i = $out.length; $i -lt $s.length; $i++) {
        $buf[$bufStart++] = $buf[$i]
    }
}
if ($bufStart) { $target.Write($buf, 0, $bufStart) }
$source.Close()
$target.Close()
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the example. I was able to use this, modify it a bit, and churn through my biggest files in seconds. :D
0

Still not what I would call fast, but this is considerably faster than what you have listed by using the -Join operator:

$reader = New-Object Microsoft.VisualBasic.fileio.textfieldparser $source
$reader.SetDelimiters(",")

While(!$reader.EndOfData){
    $line = $reader.ReadFields()
    $line -join '|' | Add-Content C:\Temp\TestOutput.csv
}

That took a hair under 32 seconds to process a 20MB file. At that rate your 750MB file would be done in under 20 minutes, and bigger files should go at about 26 minutes per gig.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.