2

I would like to index every 100th line of a very large text file with its corresponding byte offset. As I'm reading through the file to create my index with a bufferedreader, is it possible to figure out which byte position I am at?

3 Answers 3

1

You can use:

public int read(char[] cbuf,
                int off,
                int len)
         throws IOException

And use the return value int, which is the numbers of characters read, then keep this information on a counter, so that when you have read 100 of this:

System.getProperty("line.separator");

you can use this counter to get the byte position you are at.

Sign up to request clarification or add additional context in comments.

5 Comments

Is System.getProperty("line.separator"); equivalent to '\n'?
lpg - It's system dependent. Each JVM for each kind of system could use a different delimiter. Windows, for example, uses \r\n instead of \n. Using this property allows you to not worry about that detail across systems. On Linux systems, it is \n but this is certainly not the case among all systems and there's no reason to assume \n (unless you can't use System.getProperty() for some reason).
This is what I would do, but just using System.getProperty("line.separator"); is not enough because you can have unix-style text file on windows and vice-versa. Would be better if you check against "\r{0,1}\n{0,1}". But if you can do some pre-computation to convert every line separator to '\n' things will be much easier.
\r{0,1}\n{0,1} <- I've never seen notation like this before. What does {0,1} do?
Unfortunately it matches empty string. ;) You need (\r|\r\n|\n)
1

You could use a RandomAccessFile. Use the readLine method to get the next N lines of text, then determine your current position in the file using the getFilePointer method.

The one caveat is that this cannot handle reading in Unicode strings.

2 Comments

Reading through a 1gb file, would a randomaccessfile approach be appreciably slower than a bufferedreader approach?
The read speeds should be about the same for both. However obtaining the byte offset will be much faster with a random access file. With the important caveat that RAF's read non-Unicode strings only.
0

Using BufferedReader is no good, unless you can be sure that your lines are all ASCII and the linebreaks are consistent (either all CR+LF or all LF only). I suggest you use BufferedInputStream and and search for '\n' instead.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.