2

I have a large log file with client-id as one of the fields in each log line. I would like to split this large log file in to several files grouped by client-id. So, if the original file has 10 lines with 10 unique client-ids, then at the end there will be 10 files with 1 line in each.

I am trying to do this in Scala and don't want to load the entire file in to memory, load one line at a time using scala.io.Source.getLines(). That is working nicely. But, I don't have a good way to write it out in to separate files one line at a time. I can think of two options:

  1. Create a new PrintWriter backed by a BufferedWriter (Files.newBufferedWriter) for every line. This seems inefficient.

  2. Create a new PrintWriter backed by a BufferedWriter for every output File, hold on to these PrintWriters and keep writing to them till we read all lines in the original log file and the close them. This doesn't seems a very functional way to do in Scala.

Being new to Scala I am not sure of there are other better way to accomplish something like this. Any thoughts or ideas are much appreciated.

3
  • Why is this tagged Java? Commented Apr 8, 2015 at 14:26
  • 1
    I think (independent of the language), you have to bite the bullet. Either you create one new writer per line; or you create one writer "per" output file - but that means that all those writers should stay alive (where I dont see a big problem with that; unless we are talking about so many output writers ... that your application runs out of OS file handles). Commented Apr 8, 2015 at 14:31
  • I agree with @Jägermeister. To add more out-of-the-box ideas, you could: 1.- sort the file by client-id first (maybe directly using linux sort), then you only need to read 1 file and write to 1 file each time. 2.- Use some sort of map-reduce platform like Spark to do this in 3 lines of code and forget about the details. Of course, it all depends on what your final goal is. Commented Apr 8, 2015 at 19:03

1 Answer 1

0

You can do the second option in pretty functional, idiomatic Scala. You can keep track of all of your PrintWriters, and fold over the lines of the file:

import java.io._
import scala.io._

Source.fromFile(new File("/tmp/log")).getLines.foldLeft(Map.empty[String, PrintWriter]) { 
    case (printers, line) =>
        val id = line.split(" ").head
        val printer = printers.get(id).getOrElse(new PrintWriter(new File(s"/tmp/log_$id")))
        printer.println(line)
        printers.updated(id, printer)
}.values.foreach(_.close)

Maybe in a production level version, you'd want to wrap the I/O operations in a try (or Try), and keep track of failures that way, while still closing all the PrintWriters at the end.

Sign up to request clarification or add additional context in comments.

2 Comments

This solution works for me. Is the "case" keyword required in the foldLeft, seems to work without it also?
Yup -- you can omit the case here!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.