4

I have a large string in a file (its encoded data, my my custom encoding) and I want to read it and process it into my special format (decode). I want to know whats the fastest way I can do it to get the final format. I thought of some ways but not sure which would be best.

1) read entire string in 1 line and then process that string.

2) read character by character from the file and process while I am reading.

Can anyone help? Thanks

6
  • Are you saying both methods are basically the same thing? But wont the first method take more memory? Commented Aug 15, 2015 at 19:12
  • I am going to turn my comment into a answer because it will get too complicated for the comment section. Commented Aug 15, 2015 at 19:15
  • I think that you can also check for the equivalent of mmap in java. Commented Aug 15, 2015 at 19:29
  • that depends a lot on your actual format. Commented Aug 15, 2015 at 19:34
  • also, define large file Commented Aug 15, 2015 at 19:35

3 Answers 3

4

Chances are the process will be IO bound not CPU bound so it probably wont matter much and if it does it will be because of the decode function, which isn't given in the question.

In theory you have two trade situations, which will determine if (1) or (2) is faster.

The assumption is that the decode is fast and so your process will be IO bound.

If by reading the whole file into memory at once you are doing less context switching then you will wasting less CPU cycles on those context switches so then reading the whole file is faster.

If by reading in the file char by char you don't prematurely yield your time to a CPU then in theory you could use the IO waiting CPU cycles to run the decode so then ready char by char will be faster.

Here are some timelines

read char by char good case

TIME    -------------------------------------------->
IO:     READ CHAR --> wait -->   READ CHAR --> wait 
DECODE: wait ------> DECODE --> wait --->  DECODE ...

read char by char bad case

TIME    -------------------------------------------->
IO:     READ CHAR --> YIELD          -->  READ CHAR --> wait 
DECODE: wait ------>  YIELD          --> DECODE --->  wait DECODE ---> ...

read whole file

TIME    -------------------------------------------->
IO:     READ CHAR .....  READ CHAR --> FINISH
DECODE: -----------------------------> DECODE --->

If your decode was really slow then a producer consumer model would probably be faster. Your best bet is to use a BufferedReader will do as much IO as it can while waisting/yielding the least amount of CPU cycles.

Sign up to request clarification or add additional context in comments.

3 Comments

What if different java programs try to read from the same file at same time?
Depends on the file system of course, but generally speaking that won't speed up anything. You will still be IO bound.
I think the concurrent file reads are a functional requirement rather than attempt to read faster - @omega can you confirm?
3

It's fine to use a BufferedReader or BufferedInputStream and then process character by character; the buffer will read in multiple characters at a time transparently. This should give good enough performance for typical requirements.

Reading whole string is called "slurping" and given memory overhead is generally considered to be a last resort for file processing. If you are processing the in-memory string character by character anyway, it may not even have a detectable speed benefit since all you are doing is your own (very large) buffer.

With a BufferedReader or BufferedInputStream you can adjust the buffer size so it can be large if really necessary.

Given your file size (20-30mb), depending upon encoding of that file note also that Java char is 16-bit so for an ASCII text file, or a UTF-8 file with few extended characters, you must allow for double your memory usage for typical JVM implementations.

2 Comments

But actually different instances of the java program will run and a bunch can try to access the same file at the same time. Does this change your opinion?
Access for read only? No, either way the OS will likely cache and you won't see a difference. I've added that you can adjust buffer size, even to be equivalent to loading in whole string. BufferedReader/BufferesInputStream let's you change your mind without redesign, just tune one number.
0

It depends on the decode processing.

If you can parallelize it, you might consider a map/reduce approach. Break the file contents into separate map steps and combine them to get the final result in the reduce step.

Most machines have multiple cores. If there's no communication required between processors you can reduce the time to process by 1/N if you have N cores. You'll really have something if you have GPUs you can leverage.

2 Comments

You can only reduce the time to 1/N if there are no "strictly serial" parts to algorithm. You can't assume that all the IO is not serial and in fact from a file system like on windows (not sure about 10) it is basically all serial even if you chunk it. See en.wikipedia.org/wiki/Amdahl%27s_law
I'd duplicate the file in something like Hadoop file system and let each map step read its chunk.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.