Fastest way to read and process string from large file in java?

Question

I have a large string in a file (its encoded data, my my custom encoding) and I want to read it and process it into my special format (decode). I want to know whats the fastest way I can do it to get the final format. I thought of some ways but not sure which would be best.

1) read entire string in 1 line and then process that string.

2) read character by character from the file and process while I am reading.

Can anyone help? Thanks

Are you saying both methods are basically the same thing? But wont the first method take more memory? — omega
– omega, Commented Aug 15, 2015 at 19:12
I am going to turn my comment into a answer because it will get too complicated for the comment section. — Victory
– Victory, Commented Aug 15, 2015 at 19:15
I think that you can also check for the equivalent of mmap in java. — Alon
– Alon, Commented Aug 15, 2015 at 19:29

Victory · Accepted Answer · 2015-08-15 19:30:26Z

4

Chances are the process will be IO bound not CPU bound so it probably wont matter much and if it does it will be because of the decode function, which isn't given in the question.

In theory you have two trade situations, which will determine if (1) or (2) is faster.

The assumption is that the decode is fast and so your process will be IO bound.

If by reading the whole file into memory at once you are doing less context switching then you will wasting less CPU cycles on those context switches so then reading the whole file is faster.

If by reading in the file char by char you don't prematurely yield your time to a CPU then in theory you could use the IO waiting CPU cycles to run the decode so then ready char by char will be faster.

Here are some timelines

read char by char good case

TIME    -------------------------------------------->
IO:     READ CHAR --> wait -->   READ CHAR --> wait 
DECODE: wait ------> DECODE --> wait --->  DECODE ...

read char by char bad case

TIME    -------------------------------------------->
IO:     READ CHAR --> YIELD          -->  READ CHAR --> wait 
DECODE: wait ------>  YIELD          --> DECODE --->  wait DECODE ---> ...

read whole file

TIME    -------------------------------------------->
IO:     READ CHAR .....  READ CHAR --> FINISH
DECODE: -----------------------------> DECODE --->

If your decode was really slow then a producer consumer model would probably be faster. Your best bet is to use a BufferedReader will do as much IO as it can while waisting/yielding the least amount of CPU cycles.

edited Aug 15, 2015 at 19:30

answered Aug 15, 2015 at 19:25

Victory

5,9113 gold badges28 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

omega Over a year ago

What if different java programs try to read from the same file at same time?

Victory Over a year ago

Depends on the file system of course, but generally speaking that won't speed up anything. You will still be IO bound.

drrob Over a year ago

I think the concurrent file reads are a functional requirement rather than attempt to read faster - @omega can you confirm?

drrob · Accepted Answer · 2015-08-15 20:15:28Z

3

It's fine to use a BufferedReader or BufferedInputStream and then process character by character; the buffer will read in multiple characters at a time transparently. This should give good enough performance for typical requirements.

Reading whole string is called "slurping" and given memory overhead is generally considered to be a last resort for file processing. If you are processing the in-memory string character by character anyway, it may not even have a detectable speed benefit since all you are doing is your own (very large) buffer.

With a BufferedReader or BufferedInputStream you can adjust the buffer size so it can be large if really necessary.

Given your file size (20-30mb), depending upon encoding of that file note also that Java char is 16-bit so for an ASCII text file, or a UTF-8 file with few extended characters, you must allow for double your memory usage for typical JVM implementations.

edited Aug 15, 2015 at 20:15

answered Aug 15, 2015 at 19:28

drrob

6528 silver badges16 bronze badges

2 Comments

omega Over a year ago

But actually different instances of the java program will run and a bunch can try to access the same file at the same time. Does this change your opinion?

drrob Over a year ago

Access for read only? No, either way the OS will likely cache and you won't see a difference. I've added that you can adjust buffer size, even to be equivalent to loading in whole string. BufferedReader/BufferesInputStream let's you change your mind without redesign, just tune one number.

duffymo · Accepted Answer · 2015-08-15 19:36:06Z

0

It depends on the decode processing.

If you can parallelize it, you might consider a map/reduce approach. Break the file contents into separate map steps and combine them to get the final result in the reduce step.

Most machines have multiple cores. If there's no communication required between processors you can reduce the time to process by 1/N if you have N cores. You'll really have something if you have GPUs you can leverage.

answered Aug 15, 2015 at 19:36

duffymo

310k46 gold badges376 silver badges571 bronze badges

2 Comments

Victory Over a year ago

You can only reduce the time to 1/N if there are no "strictly serial" parts to algorithm. You can't assume that all the IO is not serial and in fact from a file system like on windows (not sure about 10) it is basically all serial even if you chunk it. See en.wikipedia.org/wiki/Amdahl%27s_law

duffymo Over a year ago

I'd duplicate the file in something like Hadoop file system and let each map step read its chunk.

Collectives™ on Stack Overflow

Fastest way to read and process string from large file in java?

3 Answers 3

Here are some timelines

read char by char good case

read char by char bad case

read whole file

3 Comments

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Here are some timelines

read char by char good case

read char by char bad case

read whole file

3 Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related