0

I have a file of 400+ GB like:

ID Data ...4000+columns
001 dsa
002 Data
… …
17201297 asdfghjkl

I wish to chunk down the file as per ID to get faster data retrieval as like:

mylocation/0/0/1/data.json
mylocation/0/0/2/data.json
.....
mylocation/1/7/2/0/1/2/9/7/data.json

my code is working fine but whatever writer I'm using with loop end closing it takes at least 159,206 milisoconds for 0.001% completion of file creation.

In that case can multithread be an option to reduce Time complexity (as like writing 100 or 1k files at a time)?

My Current code is:

int percent = 0;
File file = new File(fileLocation + fileName);
FileReader fileReader = new FileReader(file); // to read input file

BufferedReader bufReader = new BufferedReader(fileReader);
BufferedWriter fw = null;
LinkedHashMap<String, BufferedWriter> fileMap = new LinkedHashMap<>();
int dataCounter = 0;

while ((theline = bufReader.readLine()) != null) {
    String generatedFilename = generatedFile + chrNo + "//" + directory + "gnomeV3.json";
    Path generatedJsonFilePath = Paths.get(generatedFilename);
    if (!Files.exists(generatedJsonFilePath)) {// create directory
        Files.createDirectories(generatedJsonFilePath.getParent());
        files.createFile(generatedJsonFilePath);
    }
    String jsonData = DBFileMaker(chrNo, theline, pos);
    if (fileMap.containsKey(generatedFilename)) {
        fw = fileMap.get(generatedFilename);
        fw.write(jsonData);
    } else {
        fw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(generatedFilename)));
        fw.write(jsonData);
        fileMap.put(generatedFilename, fw);
    }
    if (dataCounter == 172 * percent) {// As I know my number of rows
        long millisec = stopwatch.elapsed(TimeUnit.MILLISECONDS);
        System.out.println("Upto: " + pos + " as " + (Double) (0.001 * percent)
        + "% completion successful." + " took: " + millisec + " miliseconds");
        percent++;
    }
    dataCounter++;
}
for (BufferedWriter generatedFiles : fileMap.values()) {
    generatedFiles.close();
}

1 Answer 1

2

That really depends on your storage. Magnetic disks really like sequential writes, so multithreading would probably have a bad effect on their performance. However, SSDs may benefit from multithreaded writing.

What you should do is Either separate your code to 2 threads, where one thread creates the buffers of data to be written to disk and the second thread only writes the data. This way your disk would always keep busy and not wait for more data to be generated.

Or to have a single thread that generates the buffers to be written, but to use java nio in order to write the data asynchronously, while going on to generate the next buffer.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.