0

I have a code that takes too long to calculate an Array[Array[Array[Double]]], but I need the result to calculate something else with different parameters but the same Array[Array[Array[Double]]] so it would be more convenient to have it in my files and import it.

I can just make a long string, putting every row after another, but I guess there is a more sophisticated way to do so. It would be perfect to just import this files and I can use it as an Array[Array[Array[Double]]].

So I need your help with this. Thanks

I've tried to convert it into a long string and export it, but it was a mess.

1
  • Write it as JSON, or any serializable format? With maybe some compression. Do you need the file to be small or rather fast to load? Or both? What are the requirements? Commented May 14, 2024 at 17:14

3 Answers 3

2

You can use java.io.ObjectOutputStream:

Welcome to Scala 3.3.0 (17.0.5, Java Java HotSpot(TM) 64-Bit Server VM).
Type in expressions for evaluation. Or try :help.

scala> import java.io.{ObjectInputStream as OIS, ObjectOutputStream as OOS, FileOutputStream as FOS, FileInputStream as FIS}

scala> import scala.util.Using

scala> val values = Array.tabulate(2, 3, 4)((i, j, k) => i + j + k + 0.0)
val values: Array[Array[Array[Double]]] = Array(
                                          Array(Array(0.0, 1.0, 2.0, 3.0), 
                                          Array(1.0, 2.0, 3.0, 4.0), 
                                          Array(2.0, 3.0, 4.0, 5.0)), 
                                          Array(Array(1.0, 2.0, 3.0, 4.0),
                                          Array(2.0, 3.0, 4.0, 5.0), 
                                          Array(3.0, 4.0, 5.0, 6.0)))

scala> Using(OOS(FOS("D:/test.bin")))(_.writeObject(values))
val res0: scala.util.Try[Unit] = Success(())

scala> val results = Using(OIS(FIS("D:/test.bin")))(_.readObject()).get.asInstanceOf[Array[Array[Array[Double]]]]
val results: Array[Array[Array[Double]]] = Array(Array(
                                           Array(0.0, 1.0, 2.0, 3.0), 
                                           Array(1.0, 2.0, 3.0, 4.0), 
                                           Array(2.0, 3.0, 4.0, 5.0)), 
                                           Array(Array(1.0, 2.0, 3.0, 4.0), 
                                           Array(2.0, 3.0, 4.0, 5.0), 
                                           Array(3.0, 4.0, 5.0, 6.0)))
Sign up to request clarification or add additional context in comments.

Comments

0

It depends on what languages you will use for other calculation. I would like to propose two variants

  1. HDF5 file format - it is very convenient for scientific calculations and support n-dimensional array. You will find scala/java library easily.
  2. Or any binary format like protobuf or avro - they also supports arrays.

If you are not planning to use other languages than scala, the fastest way to store arrays will be pure JVM serialization with Serializable.

Comments

0

A simple solution would be to write a plain file following some custom format.

This is an example if your data is symmetric (e.g. all the inner Arrays have the same size).
It first writes a header line with the sizes of each Array and then one line per element.

def writeData(data: Data)(writeLine: String => Unit): Unit =
  val tensors = data.length
  val rows = if (tensors > 0) data.head.length else 0
  val columns = if (rows > 0) data.head.head.length else 0

  // Print header line.
  writeLine(s"${tensors},${rows},${columns}")

  // You may use while loops here if you prefer.
  data.foreach { row =>
    row.foreach { column =>
      column.foreach { value =>
        writeLine(value.toString)
      }
    }
  }
end writeData

You may use better names.

And then, the reading function is also pretty straightforward:

def readData(lines: Iterator[String]): Data =
  // We are assuming here the file has at least the header line.
  val header = lines.next().split(',')
  val tensors = header(0).toInt
  val rows = header(1).toInt
  val columns = header(2).toInt

  // You may use while loops here if you prefer.
  ArraySeq.fill(tensors, rows, columns) {
    // We are assuming here the file only has double values.
    lines.next().toDouble
  }
end readData

You may add error handling if you consider that important.


You can see the code running in Scastie
It uses a List as an in-memory file just to show that it works, the real code would use real I/O APIs, whichever you prefer; Java nio, Java io, Scala Source (only works for reading), better-files, akka, fs2, etc.


Another thing to consider would be to use Bytes directly rather than Strings which would save a lot of encoding and decoding time and file size; but at the expense of it not being human readable.
Since Ints and Doubles have fixed sizes, you may write the first three Ints and then all the Doubles into a binary file. Then, when reading, rather than an Iterator[String] you would have an Iterator[Byte], you then would use ByteBuffer to make the conversions:

def writeData(data: Data)(writeByte: Byte => Unit): Unit =
  val tensors = data.length
  val rows = if (tensors > 0) data.head.length else 0
  val columns = if (rows > 0) data.head.head.length else 0

  def writeInt(value: Int): Unit =
    ByteBuffer.allocate(4).putInt(value).array.foreach(writeByte)
  end writeInt

  writeInt(tensors)
  writeInt(rows)
  writeInt(columns)

  def writeDouble(value: Double): Unit =
    ByteBuffer.allocate(8).putDouble(value).array.foreach(writeByte)
  end writeDouble
  
  data.foreach { row =>
    row.foreach { column =>
      column.foreach(writeDouble)
    }
  }
end writeData

def readData(lines: Iterator[Byte]): Data =
  def readInt(): Int =
    val bb = ByteBuffer.allocate(4)
    bb.put(lines.next())
    bb.put(lines.next())
    bb.put(lines.next())
    bb.put(lines.next())
    bb.getInt(0)
  end readInt

  val tensors = readInt()
  val rows = readInt()
  val columns = readInt()

  def readDouble(): Double =
    val bb = ByteBuffer.allocate(8)
    bb.put(lines.next())
    bb.put(lines.next())
    bb.put(lines.next())
    bb.put(lines.next())
    bb.put(lines.next())
    bb.put(lines.next())
    bb.put(lines.next())
    bb.put(lines.next())
    bb.getDouble(0)
  end readDouble

  ArraySeq.fill(tensors, rows, columns)(readDouble())
end readData

You can see the code running in Scastie


If your data is not symmetric, you would need to adapt the code a little bit. Rather than a single header, you would need to have multiple "start section" kind of lines that says how many elements will come next. This version could even be made generic to write and read any number of nested Arrays.


Finally, you may consider using a library like scodec to handle the binary data.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.