0

I am currently looking to write to multiple files simultaneously. The files will hold about 17 Million lines of integers.

Currently, I am opening 5 Files that can be written to,(Some will remain empty), and then I perform shifting calculations to get a multiplier for the integer and to decide which files to write on.

My code looks like:

//Make files directory 
File tempDir = new File("temp/test/txtFiles");
tempDir.mkdirs();

List<File> files = new ArrayList<>(); //Will hold the Files

List<FileWriter> writers = new ArrayList<>(); //Will hold fileWriter objects for all of the files 

File currTxtFile; //Used to create the files

//Create the files
//Unused files will be blank
for(int f = 0; f < 5; f++)
{
   currTxtFile = new File(tempDir, "file" + f + ".txt");
   currTxtFile.createNewFile();
   files.add(currTxtFile);
   FileWriter fw = new FileWriter(currTxtFile);
   writers.add(fw);
}

int[] multipliers = new int[5]; //will be used to calculate what to write to file
int[] fileNums = new int[5]; //will be used to know which file to write to
int start = 0; 

/**
An example of fileNums output would be {0,4,0,1,4} 
(i.e write the to file 0, then 4, then 0, then 1, then 4)

An example of multipliers output would be {100,10,5,1,2000} 
(i.e value uses 100 for file 0, then 10 for file 4, then 5 for file 0, then 1 for file 1, then 2000 for file 4)
*/


for(long c = 0; c < 16980000, c++)
{
  //Gets values for the multipliers and fileNums
  int numOfMultipliers = getMultiplier(start,multipliers,fileNums); 
  for(int j = 0; j < numOfMultipliers; j++) // NumOfMultipliers can range from 0-4 
  {
    int val = 30000000 * multipler[j] + 20000000;
    writers.get(fileNums[j]).append(val + "\n");
  }
  start++;
}

for(FileWriter f: writers)
{
  f.close();
}

The code is currently taking quite a while to write to the files (Over several hours (5+)). This code was translated from C++, where the files would output in about 10 minutes. How could I improve upon the code to get the output to write quicker?

1 Answer 1

4

Likely flushing issues. In general, writing to multiple files is slower than writing to a single file, not faster. Think about it - with spinning disks, that thing doesn't have 5 separate write heads inside it. There's just the one, the process of writing to a spinning disk is fundamentally 'single threaded' - trying to write to multiple files simultaneously is in fact orders of magnitude slower, as the write head has to bounce around.

With modern SSDs it doesn't matter nearly as much, but there's still a bottleneck somewhere. It's either the disk or it isn't. There's nothing inherent about SSD design (for example, it doesn't have multiple pipelines or a whole bunch of CPUs to deal with incoming writes) that would make it faster if you write to multiple files simultaneously.

If the files exist each on a different volume, that's a different story, but from your code that's clearly not the case.

Thus, let's first get rid of this whole 'multiple files' thing. That either doesn't do anything, or makes things (significantly) slower.

So why is it slow in java?

Because of block processing. You need to know how disks work, first.

SSDs

The memory in an SSD can't actually be written to. Instead, entire blocks can be wiped clean and only then can they be written to. That's the only way an SSD can store data: Obliterate an entire block, then write data to it.

If a single block is 64k, and your code writes one integer at a time, that integer is about 10 bytes or so a pop. Your SSD will be obliterating a block, write one integer to it, a newline, and a lot of pointless further writes (it writes.. in blocks. It can't write any smaller, that's just how it works), and it'll do the exact same thing 6400 times more.

Instead, you'd want the SSD to just wipe that block and write 6400 integers into it once. The reason it doesn't just work that way out of the box is because people trip over power cables. Trust me, the bank is not going to stand for this. If you pull some bills out of an ATM and then some crash happens and because the last couple of transactions are just being stored in memory, waiting for a full block's worth of data before it actually writes, oh dear. So if you WANT to flush that stuff to disk, the system will dutifully execute.

Spinning disks

The write head needs to move to the right position and wait for the right sector to spin round and then it can write. Even though CPUs are really fast, the disk keeps spinning, it can't stop on a dime. So in the very short time it takes for the java code to supply you with another integer, the disk spins past the write point so the disk needs to wait one full spin, again. Much better to just send a much larger chunk of data to the disk controller so it can write it all in 'one spin', so to speak.

So how do I do that?

Simple. Use a BufferedWriter. This does the exact thing you want: It'll buffer data for quite a while, and only actually writes until its convenient, or you explicitly ask for it (call .flush() on it), or you close the writer. With the downside that if someone trips over a power cable, your data is gone, but presumably you don't mind - half of such a file is a problem no matter how much is there. Incomplete = useless.

Can it be faster?

Certainly. You're storing e.g. the number like '123456789' in at least 10 bytes, and the CPU needs to do conversion to turn that into the sequence [31, 32, 33, 34, 35, 36, 37, 38, 39, 13]. Much more efficient to just store exactly the bytes precisely as they are in memory - only takes 4 bytes, and no conversion needed, or at least simpler conversion. The downside is that you won't be able to make any sense of this file unless you use a hexeditor.

Example code - write integers in text form

  • Let's not use obsolete APIs.
  • Let's properly close resources.
  • Let's ditch this pointless 'multiple files' thing.
Path tgt = Paths.get("temp/test/txtFiles/out.txt");
try (var out = Files.newBufferedWriter(tgt)) {
  for (long c = 0; c < 16980000, c++) {
    //Gets values for the multipliers and fileNums
    int numOfMultipliers = getMultiplier(start, multipliers, fileNums); 
    for(int j = 0; j < numOfMultipliers; j++) { // NumOfMultipliers can range from 0-4 
      int val = 30000000 * multipler[j] + 20000000;
      out.write(val + "\n");
    }
    start++;
  }
}

Example code - write ints directly

Path tgt = Paths.get("temp/test/txtFiles/out.txt");
try (var out = new DataOutputStream(
  new BufferedOutputStream(
  Files.newOutputStream(tgt))) {

  for (long c = 0; c < 16980000, c++) {
    //Gets values for the multipliers and fileNums
    int numOfMultipliers = getMultiplier(start, multipliers, fileNums); 
    for(int j = 0; j < numOfMultipliers; j++) { // NumOfMultipliers can range from 0-4 
      int val = 30000000 * multipler[j] + 20000000;
      out.writeInt(val);
    }
    start++;
  }
}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.