3

I have large CSV files (50-500 MB each). Running complicated power shell commands on these takes forever and/or hits memory issues.

Processing the data requires grouping by common fields, say in ColumnA. So assuming that the data is already sorted by that column, if I split these files randomly (i.e. each x-thousand lines) then matching entries could still end up in different parts. There are thousands of different groups in A, so splitting every one into a single file would create to many files.

How can I split it into files of 10,000-ish lines and not lose the groups? E.g. rows 1-13 would be A1 in Column A, rows 14-17 would be A2 etc. and row 9997-10012 would be A784. In this case i would want the first file to contain rows 1-10012 and the next one to start with row 10013.

Obviously I would want to keep the entire rows (rather than just Column A), so if I pasted all the resulting files together this would be the same as the original file.

1
  • If its easier to implement then I dont mind 10,000 being the max rather than the min. So in example above first file will be rows 1-9996. Commented Feb 7, 2013 at 16:32

4 Answers 4

5

Not tested. This assumes ColumnA is the first column and it's common comma-delimited data. You'll need to adjust the line that creates the regex to suit your data.

 $count = 0

 $header = get-content file.csv -TotalCount 1

 get-content file.csv -ReadCount 1000 |
  foreach {
   #add tail entries from last batch to beginning of this batch
   $newbatch = $tail + $_ 

   #create regex to match last entry in this batch
   $regex = '^' + [regex]::Escape(($newbatch[-1].split(',')[0])) 

   #Extract everything that doesn't match the last entry to new file

     #Add header if this is not the first file
     if ($count)
       {
         $header |
           set-content "c:\somedir\filepart_$count"
        }

     $newbatch -notmatch $regex | 
      add-content "c:\somedir\filepart_$count"  

   #Extact tail entries to add to next batch
   $tail = @($newbatch -match $regex)

   #Increment file counter
   $count++ 

}
Sign up to request clarification or add additional context in comments.

6 Comments

Very nice. You should however include the csvheader in all parts, at least I would've wanted that.
Agreed, now works well with headers. Can you add (a pipe?) for sorting the input by ColumnA, which is actually the second column of file. Have adjusted the [0] to [1] which works.
The problem states the assumption is that the data is already sorted by that column....
Thanks! ("I'm stealing that!" is high praise to a scripter)
That's neat! What strikes me odd is: having semicolon-delimited data, I would have expected I have to adjust this to $regex = '^' + [regex]::Escape(($newbatch[-1].split(';')[0])). However, I forgot that at first glance and it worked just fine with the comma... makes me wonder: what in gods name does the regex do in this case?
|
2

This is my attempt, it got messy :-P It will load the whole file into memory while splitting it, but this is pure text. It should take less memory then imported objects, but still about the size of the file.

$filepath = "C:\Users\graimer\Desktop\file.csv"
$file = Get-Item $filepath
$content = Get-Content $file
$csvheader = $content[0]
$lines = $content.Count
$minlines = 10000
$filepart = 1

$start = 1

while ($start -lt $lines - 1) {
    #Set minimum $end value (last line)
    if ($start + $minlines -le $lines - 1) { $end = $start + $minlines - 1 } else { $end = $lines - 1 }

    #Value to compare. ColA is first column in my file = [0] .  ColB is second column = [1]
    $avalue = $content[$end].split(",")[0]
    #If not last line in script
    if ($end -ne $lines -1) {
        #Increase $end by 1 while ColA is the same
        while ($content[$end].split(",")[0] -eq $avalue) { $end++ }
        #Return to last line with equal ColA value
        $end--
    }
    #Create new csv-part
    $filename = $file.FullName.Replace($file.BaseName, ($file.BaseName + ".part$filepart"))
    @($csvheader, $content[$start..$end]) | Set-Content $filename

    #Fix counters
    $filepart++
    $start = $end + 1
}

file.csv:

ColA,ColB,ColC
A1,1,10
A1,2,20
A1,3,30
A2,1,10
A2,2,20
A3,1,10
A4,1,10
A4,2,20
A4,3,30
A4,4,40
A4,5,50
A4,6,60
A5,1,10
A6,1,10
A7,1,10

Results (I used $minlines = 5):

file.part1.csv:

ColA,ColB,ColC
A1,1,10
A1,2,20
A1,3,30
A2,1,10
A2,2,20

file.part2.csv:

ColA,ColB,ColC
A3,1,10
A4,1,10
A4,2,20
A4,3,30
A4,4,40
A4,5,50
A4,6,60

file.part3.csv:

ColA,ColB,ColC
A5,1,10
A6,1,10
A7,1,10

2 Comments

looks to be quite close, but the resulting files are twice as big as the original file and dont seem to be the same "CSV" as the original (when opening in Excel it doesnt pick up columns), looks ok in notepad though. Also, could you amend the input so it sorts by that ColumnA. Note that this isnt actually the first column of the file.
fixed filesizes. you have to replace delimiter and column-number to fit your needs. They are "," and 0 (first) in my sample(see both lines using split() and replace the values). You said the file would already be sorted. If you want to sort the columns in THIS script, the best way is the import the objects from csv. This will require more memory, which was your problem in the first place
0

This requires PowerShell v3 (due to -append on Export-CSV).

Also, I'm assuming that you have column headers and the first column is named col1. Adjust as necessary.

import-csv MYFILE.csv|foreach-object{$_|export-csv -notypeinfo -noclobber -append ($_.col1 + ".csv")}

This will create one file for each distinct value in the first column, with that value as the file name.

3 Comments

thanks, but as outlined above, this would create far too many files
It's not clear to me how the data is set up. Are you saying you want multiple groups per file, but not splitting a single group across multiple files?
I want multiple groups per file, but no file to be more than 10,000-ish rows. i.e. once it reaches 10,000 keep going with that group but then stop and start next file with next group
0

To compliment the helpful answer from mjolinor with a reusable function with a few additional parameters and using the steppable pipeline which is about a factor 8 faster:

function Split-Content {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory=$true)][String]$Path,
        [ULong]$HeadSize,
        [ValidateRange(1, [ULong]::MaxValue)][ULong]$DataSize = [ULong]::MaxValue,
        [Parameter(Mandatory=$true, ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true)]$Value
    )
    begin {
        $Header = [Collections.Generic.List[String]]::new()
        $DataCount = 0
        $PartNr = 1
    }
    Process {
        $ReadCount = 0
        while ($ReadCount -lt @($_).Count -and $Header.Count -lt $HeadSize) {
            if (@($_)[$ReadCount]) { $Header.Add(@($_)[$ReadCount]) }
            $ReadCount++
        }
        if ($ReadCount -lt @($_).Count -and $Header.Count -ge $HeadSize) {
            do {
                if ($DataCount -le 0) { # Should never be less
                    $FileInfo = [System.IO.FileInfo]$ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath($Path)
                    $FileName = $FileInfo.BaseName + $PartNr++ + $FileInfo.Extension
                    $LiteralPath = [System.IO.Path]::Combine($FileInfo.DirectoryName, $FileName)
                    $steppablePipeline = { Set-Content -LiteralPath $LiteralPath }.GetSteppablePipeline()
                    $steppablePipeline.Begin($PSCmdlet)
                    $steppablePipeline.Process($Header)
                }
                $Next = [math]::min(($DataSize - $DataCount), @($_).Count)
                if ($Next -gt $ReadCount) { $steppablePipeline.Process(@($_)[$ReadCount..($Next - 1)]) }
                $DataCount = ($DataCount + $Next - $ReadCount) % $DataSize
                if ($DataCount -le 0) { $steppablePipeline.End() }
                $ReadCount = $Next % @($_).Count
            } while ($ReadCount)
        } 
    }
    End {
        if ($steppablePipeline) { $steppablePipeline.End() }
    }
}

Parameters

Value
Specifies the listed content lines to be broken into parts. Multiple lines sent through the pipeline at a time (aka sub arrays like Object[]) will also be passed to the output file at a time (assuming that is fits the -DataSize).

Path
Specifies a path to one or more locations. Each filename in the location is suffixed with a part number (starting with 1).

HeadSize
The specifies the number of lines of the header that will be taken from the input and preceded in each file part. The default is 0, meaning no header line are copied.

DataSize
The specifies the number of lines that will be successively taken (after the header) from the input as data and pasted into each file part. The default is [ULong]::MaxValue, basically meaning that all data is copied to a single file.

Example 1:

Get-Content -ReadCount 1000 .\Test.Csv |Split-Content -Path .\Part.Csv -HeadSize 1 -DataSize 10000

This will split the .\Test.Csv file in chuncks of csv files with 10000 rows

Note that the performance of this Split-Content function highly depends on the -ReadCount of the prior Get-Content cmdlet.

Example 2:

Get-Process |Out-String -Stream |Split-Content -Path .\Process.Txt -HeadSize 2 -DataSize 20

This will write chunks of 20 processes to the .\Process<PartNr>.Txt files preceded with the standard (2 line) header format:

 NPM(K)    PM(M)      WS(M)     CPU(s)      Id  SI ProcessName
 ------    -----      -----     ------      --  -- -----------
 ... # 20 rows following

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.