I would like to index every 100th line of a very large text file with its corresponding byte offset. As I'm reading through the file to create my index with a bufferedreader, is it possible to figure out which byte position I am at?
3 Answers
You can use:
public int read(char[] cbuf,
int off,
int len)
throws IOException
And use the return value int, which is the numbers of characters read, then keep this information on a counter, so that when you have read 100 of this:
System.getProperty("line.separator");
you can use this counter to get the byte position you are at.
5 Comments
lgp
Is System.getProperty("line.separator"); equivalent to '\n'?
pseudoramble
lpg - It's system dependent. Each JVM for each kind of system could use a different delimiter. Windows, for example, uses \r\n instead of \n. Using this property allows you to not worry about that detail across systems. On Linux systems, it is \n but this is certainly not the case among all systems and there's no reason to assume \n (unless you can't use System.getProperty() for some reason).
Alvin
This is what I would do, but just using System.getProperty("line.separator"); is not enough because you can have unix-style text file on windows and vice-versa. Would be better if you check against "\r{0,1}\n{0,1}". But if you can do some pre-computation to convert every line separator to '\n' things will be much easier.
lgp
\r{0,1}\n{0,1} <- I've never seen notation like this before. What does {0,1} do?
Peter Lawrey
Unfortunately it matches empty string. ;) You need
(\r|\r\n|\n)You could use a RandomAccessFile. Use the readLine method to get the next N lines of text, then determine your current position in the file using the getFilePointer method.
The one caveat is that this cannot handle reading in Unicode strings.
2 Comments
lgp
Reading through a 1gb file, would a randomaccessfile approach be appreciably slower than a bufferedreader approach?
Perception
The read speeds should be about the same for both. However obtaining the byte offset will be much faster with a random access file. With the important caveat that RAF's read non-Unicode strings only.