1: I'm running into a problem trying to process a large text file - 10Gigs+
Single thread solution is the following:
val writer = new PrintWriter(new File(output.getOrElse("output.txt")));
for(line <- scala.io.Source.fromFile(file.getOrElse("data.txt")).getLines())
{
writer.println(DigestUtils.HMAC_SHA_256(line))
}
writer.close()
2: I tried concurrent processing using
val futures = scala.io.Source.fromFile(file.getOrElse("data.txt")).getLines
.map{ s => Future{ DigestUtils.HMAC_SHA_256(s) } }.to
val results = futures.map{ Await.result(_, 10000 seconds) }
This yields in a GC overhead limit exceeded exception (see Appendix A for stacktrace)
3: I tried using Akka IO with combination of AsynchronousFileChannel following https://github.com/drexin/akka-io-file I am able to read the file in byte chunks using FileSlurp but have not been able to find a solution to read file by lines which is a requirement.
Any help would be greatly appreciated. Thank you.
APPENDIX A
[error] (run-main) java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.nio.CharBuffer.wrap(Unknown Source)
at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
at sun.nio.cs.StreamDecoder.read(Unknown Source)
at java.io.InputStreamReader.read(Unknown Source)
at java.io.BufferedReader.fill(Unknown Source)
at java.io.BufferedReader.readLine(Unknown Source)
at java.io.BufferedReader.readLine(Unknown Source)
at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.s
cala:67)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:
48)
at scala.collection.immutable.VectorBuilder.$plus$plus$eq(Vector.scala:7
16)
at scala.collection.immutable.VectorBuilder.$plus$plus$eq(Vector.scala:6
92)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at com.test.Twitterhashconcurrentcli$.doConcurrent(Twitterhashconcu
rrentcli.scala:35)
at com.test.Twitterhashconcurrentcli$delayedInit$body.apply(Twitter
hashconcurrentcli.scala:62)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:
12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.generic.TraversableForwarder$class.foreach(Traversab
leForwarder.scala:32)
at scala.App$class.main(App.scala:71)