0

I'm using Java to try and extract characters between specific indices from a text file. It is a big text file and I'm not allowed to load it to the internal memory. I'm therefore limited to only reading parts of the file and hence the parts with these specific indices. How to do this?

I might also be able to utilize the linux terminal from within Java and then use something like sed or awk but in that case I will have to learn how to deal with these programs as well.

Either way it has to be quick and the whole execution of the program is not allowed to take more than one second.

Grateful for any suggestions!

1
  • Read in a line at a time. Commented Sep 11, 2017 at 16:12

2 Answers 2

1

If the index of the text file corresponds to the byte at that index, then you could use RandomAccessFile to seek to a specific byte and read information directly from there.

According to the documentation for RandomAccessFile#seek:

Sets the file-pointer offset, measured from the beginning of this file, at which the next read or write occurs.

You can do the following:

RandomAccessFile raf = new RandomAccessFile(file, "r");

raf.seek(index);

Where file is your text file, r is the mode (read), and index is the byte at which you want to begin reading.

Depending on how your text file is formatted, you can read each byte up until the next newline character \n, but you also might have to account for that when calling seek (add the number of lines to your index).

Sign up to request clarification or add additional context in comments.

Comments

0

You can stream the file and skip to whichever line you want. Once you have the line you want you can extract a substring from it as you normally would.

Take a look at this example:

long start = System.currentTimeMillis();

try (Stream<String> lines = Files.lines(Paths.get("myfile.txt"))) {
    String line = lines.skip(500000).findFirst().get();
    String extracted = line.substring(10, 20);
    System.out.println(extracted);

} catch (IOException e) {
    e.printStackTrace();
}

System.out.println("Time taken: " + (System.currentTimeMillis() - start)/1000.0);

I've tested this with a 1gb file that has 1,000,000 lines of text. It extracts a small substring from line 500,000.

Output:

Test output

3 Comments

The problem is I don't know which line I want. Only the index of the word at that line. Is there a way to discern the line number from the word index?
@ChristofferAB could you give an example of this word index?
It worked with the RandomAccessFile lib in Java. The index is the index of the first letter in the word. Each character in the text has this index.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.