Scala Iterator/Looping Technique - Large Collections

Question

I have really large tab delimited files (10GB-70GB) and need to do some read, data manipulation, and write to a separate file. The files can range from 100 to 10K columns with 2 million to 5 million rows.

The first x columns are static which are required for reference. Sample file format:

#ProductName  Brand    Customer1  Customer2  Customer3
Corolla       Toyota   Y          N          Y
Accord        Honda    Y          Y          N
Civic         Honda    0          1          1

I need to use the first 2 columns to get a product id then generate an output file similar to:

ProductID1 Customer1 Y
ProductID1 Customer2 N
ProductID1 Customer3 Y
ProductID2 Customer1 Y
ProductID2 Customer2 Y
ProductID2 Customer3 N
ProductID3 Customer1 N
ProductID3 Customer2 Y
ProductID3 Customer3 Y

Current sample code:

val fileNameAbsPath = filePath + fileName
val outputFile = new PrintWriter(filePath+outputFileName)
var customerList = Array[String]()

for(line <- scala.io.Source.fromFile(fileNameAbsPath).getLines()) {
    if(line.startsWith("#")) {
        customerList = line.split("\t")
    }
    else {
        val cols = line.split("\t")

        val productid = getProductID(cols(0), cols(1))
        for (i <- (2 until cols.length)) {
            val rowOutput = productid + "\t" + customerList(i) + "\t" + parser(cols(i))

            outputFile.println(rowOutput)
            outputFile.flush()
        }
    }
}
outputFile.close()

One of tests that I ran took about 12 hours to read a file (70GB) that has 3 million rows and 2500 columns. The final output file generated 250GB with about 800+ million rows.

My question is: is there anything in Scala other than what I'm already doing that can offer quicker performance?

That's quite looking like an assignment, which would not fit there as a question — cchantep
– cchantep, Commented Oct 5, 2017 at 22:36
i'm sorry if i've misinterpreted, but i'm just looking for ideas rather than someone helping me with coding.. i'm fairly new to scala and was wondering if it would offer better performance. — valenjimmo
– valenjimmo, Commented Oct 5, 2017 at 22:46
I for one would move the if clause dealing with header line out of for loop. No need to perform the check for every line if you know that header line will be present only once. Second, unless you really want to make sure you don't want to miss any writes, flush for every write will slowdown the performance, I'd begin with using BufferedWriter along with PrintWriter and let them take care of flushing the dirty bits. — lztachyon
– lztachyon, Commented Oct 5, 2017 at 23:34

Dima · Accepted Answer · 2017-10-06 00:36:03Z

1

Ok, some ideas ...

As mentioned in the comments, you don't want to flush after every line. So, yeah, get rid of it.
Moreover, PrintWriter by default flushes after every newline anyway (so, currently, you are actually flushing twice :)). Use a two-argument constructor, when creating PrintWriter, and make sure the second parameter is false
You don't need to create BufferedWriter explicitly, PrintWriter is already buffering by default. The default buffer size is 8K, you might want to try to play around with it, but it will probably not make any difference, because, last I checked, the underlying FileOutputStream ignores all that, and flushes kilobyte-sized chunks either way.
Get rid of gluing rows together in a variable, and just write each field straight to the output.
If you do not care about the order in which lines appear in the output, you can trivially parallelize the processing (if you do care about the order, you still can, just a little bit less trivially), and write several files at once. That would help tremendously, if you place your output chunks on different disks and/or if you have multiple cores to run this code. You'd need to rewrite your code in (real) scala to make it thread safe, but that should be easy enough.
Compress data as it is being written. Use GZipOutputStream for example. That not only lets you reduce the physical amount of data actually hitting the disks, but also allows for a much larger buffer
Check out what your parser thingy is doing. You didn't show the implementation, but something tells me it is likely not free.
split can get prohibitively expensive on huge strings. People often forget, that its parameter is actually a regular expression. You are probably better off writing a custom iterator or just using good-old StringTokenizer to parse the fields out as you go, rather than splitting up-front. At the very least, it'll save you one extra scan per line.

Finally, last, but by no measure least. Consider using spark and hdfs. This kind of problems is the very area where those tools really excel.

answered Oct 6, 2017 at 0:36

Dima

40.6k6 gold badges54 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Oleg Pyzhcov Over a year ago

According to OpenJDK 8 sources, splitting on a single char is reasonably fast.

Dima Over a year ago

@OlegPyzhcov yeah ... but still, you have to create an array with 2500 columns, and fill it in up-front. And if that was your reason for downvoting my answer, then ... you are a jerk :)

Oleg Pyzhcov Over a year ago

no, just pointed out a curious and not very widely known thing (and your answer is good, I'm with upvoters :)

Dima Over a year ago

@OlegPyzhcov Ok then, sorry for doubting you :)

Collectives™ on Stack Overflow

Scala Iterator/Looping Technique - Large Collections

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related