A simple solution would be to write a plain file following some custom format.
This is an example if your data is symmetric (e.g. all the inner Arrays have the same size).
It first writes a header line with the sizes of each Array and then one line per element.
def writeData(data: Data)(writeLine: String => Unit): Unit =
val tensors = data.length
val rows = if (tensors > 0) data.head.length else 0
val columns = if (rows > 0) data.head.head.length else 0
// Print header line.
writeLine(s"${tensors},${rows},${columns}")
// You may use while loops here if you prefer.
data.foreach { row =>
row.foreach { column =>
column.foreach { value =>
writeLine(value.toString)
}
}
}
end writeData
You may use better names.
And then, the reading function is also pretty straightforward:
def readData(lines: Iterator[String]): Data =
// We are assuming here the file has at least the header line.
val header = lines.next().split(',')
val tensors = header(0).toInt
val rows = header(1).toInt
val columns = header(2).toInt
// You may use while loops here if you prefer.
ArraySeq.fill(tensors, rows, columns) {
// We are assuming here the file only has double values.
lines.next().toDouble
}
end readData
You may add error handling if you consider that important.
You can see the code running in Scastie
It uses a List as an in-memory file just to show that it works, the real code would use real I/O APIs, whichever you prefer; Java nio, Java io, Scala Source (only works for reading), better-files, akka, fs2, etc.
Another thing to consider would be to use Bytes directly rather than Strings which would save a lot of encoding and decoding time and file size; but at the expense of it not being human readable.
Since Ints and Doubles have fixed sizes, you may write the first three Ints and then all the Doubles into a binary file. Then, when reading, rather than an Iterator[String] you would have an Iterator[Byte], you then would use ByteBuffer to make the conversions:
def writeData(data: Data)(writeByte: Byte => Unit): Unit =
val tensors = data.length
val rows = if (tensors > 0) data.head.length else 0
val columns = if (rows > 0) data.head.head.length else 0
def writeInt(value: Int): Unit =
ByteBuffer.allocate(4).putInt(value).array.foreach(writeByte)
end writeInt
writeInt(tensors)
writeInt(rows)
writeInt(columns)
def writeDouble(value: Double): Unit =
ByteBuffer.allocate(8).putDouble(value).array.foreach(writeByte)
end writeDouble
data.foreach { row =>
row.foreach { column =>
column.foreach(writeDouble)
}
}
end writeData
def readData(lines: Iterator[Byte]): Data =
def readInt(): Int =
val bb = ByteBuffer.allocate(4)
bb.put(lines.next())
bb.put(lines.next())
bb.put(lines.next())
bb.put(lines.next())
bb.getInt(0)
end readInt
val tensors = readInt()
val rows = readInt()
val columns = readInt()
def readDouble(): Double =
val bb = ByteBuffer.allocate(8)
bb.put(lines.next())
bb.put(lines.next())
bb.put(lines.next())
bb.put(lines.next())
bb.put(lines.next())
bb.put(lines.next())
bb.put(lines.next())
bb.put(lines.next())
bb.getDouble(0)
end readDouble
ArraySeq.fill(tensors, rows, columns)(readDouble())
end readData
You can see the code running in Scastie
If your data is not symmetric, you would need to adapt the code a little bit. Rather than a single header, you would need to have multiple "start section" kind of lines that says how many elements will come next. This version could even be made generic to write and read any number of nested Arrays.
Finally, you may consider using a library like scodec to handle the binary data.