I have really large tab delimited files (10GB-70GB) and need to do some read, data manipulation, and write to a separate file. The files can range from 100 to 10K columns with 2 million to 5 million rows.
The first x columns are static which are required for reference. Sample file format:
#ProductName Brand Customer1 Customer2 Customer3
Corolla Toyota Y N Y
Accord Honda Y Y N
Civic Honda 0 1 1
I need to use the first 2 columns to get a product id then generate an output file similar to:
ProductID1 Customer1 Y
ProductID1 Customer2 N
ProductID1 Customer3 Y
ProductID2 Customer1 Y
ProductID2 Customer2 Y
ProductID2 Customer3 N
ProductID3 Customer1 N
ProductID3 Customer2 Y
ProductID3 Customer3 Y
Current sample code:
val fileNameAbsPath = filePath + fileName
val outputFile = new PrintWriter(filePath+outputFileName)
var customerList = Array[String]()
for(line <- scala.io.Source.fromFile(fileNameAbsPath).getLines()) {
if(line.startsWith("#")) {
customerList = line.split("\t")
}
else {
val cols = line.split("\t")
val productid = getProductID(cols(0), cols(1))
for (i <- (2 until cols.length)) {
val rowOutput = productid + "\t" + customerList(i) + "\t" + parser(cols(i))
outputFile.println(rowOutput)
outputFile.flush()
}
}
}
outputFile.close()
One of tests that I ran took about 12 hours to read a file (70GB) that has 3 million rows and 2500 columns. The final output file generated 250GB with about 800+ million rows.
My question is: is there anything in Scala other than what I'm already doing that can offer quicker performance?
ifclause dealing with header line out offorloop. No need to perform the check for every line if you know that header line will be present only once. Second, unless you really want to make sure you don't want to miss any writes,flushfor every write will slowdown the performance, I'd begin with usingBufferedWriteralong withPrintWriterand let them take care of flushing the dirty bits.