java- create simple byte offset index of a text file

Question

I would like to index every 100th line of a very large text file with its corresponding byte offset. As I'm reading through the file to create my index with a bufferedreader, is it possible to figure out which byte position I am at?

Oscar Gomez · Accepted Answer · 2011-07-20 03:20:16Z

1

You can use:

public int read(char[] cbuf,
                int off,
                int len)
         throws IOException

And use the return value int, which is the numbers of characters read, then keep this information on a counter, so that when you have read 100 of this:

System.getProperty("line.separator");

you can use this counter to get the byte position you are at.

answered Jul 20, 2011 at 3:20

Oscar Gomez

18.5k14 gold badges88 silver badges119 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

lgp Over a year ago

Is System.getProperty("line.separator"); equivalent to '\n'?

pseudoramble Over a year ago

lpg - It's system dependent. Each JVM for each kind of system could use a different delimiter. Windows, for example, uses \r\n instead of \n. Using this property allows you to not worry about that detail across systems. On Linux systems, it is \n but this is certainly not the case among all systems and there's no reason to assume \n (unless you can't use System.getProperty() for some reason).

Alvin Over a year ago

This is what I would do, but just using System.getProperty("line.separator"); is not enough because you can have unix-style text file on windows and vice-versa. Would be better if you check against "\r{0,1}\n{0,1}". But if you can do some pre-computation to convert every line separator to '\n' things will be much easier.

lgp Over a year ago

\r{0,1}\n{0,1} <- I've never seen notation like this before. What does {0,1} do?

Peter Lawrey Over a year ago

Unfortunately it matches empty string. ;) You need (\r|\r\n|\n)

Perception · Accepted Answer · 2011-07-20 03:32:31Z

1

You could use a RandomAccessFile. Use the readLine method to get the next N lines of text, then determine your current position in the file using the getFilePointer method.

The one caveat is that this cannot handle reading in Unicode strings.

answered Jul 20, 2011 at 3:32

Perception

80.8k19 gold badges190 silver badges197 bronze badges

2 Comments

lgp Over a year ago

Reading through a 1gb file, would a randomaccessfile approach be appreciably slower than a bufferedreader approach?

Perception Over a year ago

The read speeds should be about the same for both. However obtaining the byte offset will be much faster with a random access file. With the important caveat that RAF's read non-Unicode strings only.

unkx80 · Accepted Answer · 2011-07-20 03:22:36Z

0

Using BufferedReader is no good, unless you can be sure that your lines are all ASCII and the linebreaks are consistent (either all CR+LF or all LF only). I suggest you use BufferedInputStream and and search for '\n' instead.

answered Jul 20, 2011 at 3:22

unkx80

961 bronze badge

Collectives™ on Stack Overflow

java- create simple byte offset index of a text file

3 Answers 3

5 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related