3

I need to implement a simple indexing scheme for a big text file. The text file contains key value pairs and I need to read back a specific key value pair without loading the complete file in memory. The text file is huge and contains millions of entries and the keys are not sorted. Different key-value pairs need to be read depending on user-input. So I don't want the complete file to be read every time. Please let me know the exact classes and methods in java file handling api that would help to implement this in a simple and efficient way.I want to do this without using an external library such as lucene.

5
  • 5
    If the key value pair you want happens to be the last one in the file, you're going to have to read the whole thing at some point. Commented Nov 22, 2011 at 18:18
  • If you try to find a key which is not there, you all have to read the whole file. To index the file you can read it once, avoiding the need to read it again. How big is the file? You might be able to load it all. Commented Nov 22, 2011 at 18:59
  • Right, he'd have to read the whole file at least once in order to index it. When you index it, you could keep track of the byte location of each key/value pair. Then, to retrieve a particular key/value pair, you would get its byte location and then skip to that location in the file to read the value. But if the values are small, all of this functionality might not be worth it because the index itself will take up a lot of memory. Commented Nov 22, 2011 at 19:21
  • @Michael:I need to know the exact classes and functions that in java that would do it in a simple way. The text file is huge containing millions of entries and the values are long strings.So I need to implement a small index and way to skip to a particular entry in the text file Commented Nov 22, 2011 at 19:48
  • @vjain27 I don't know how to do random access with a file, but I know it's possible. Try using the SeekableByteChannel class. In terms of the index, you could probably just use a Map, where the map's key is your key and the map's value is the byte location of the key/value pair in the file. Commented Nov 23, 2011 at 14:08

3 Answers 3

5

As the comments pointed out, you're going to need to do a linear search of the entire file in worst case, and half of it on average. But fortunately there are some tricks you can do.

If the file doesn't change much, then create a copy of the file in which the entries are sorted. Ideally make records in the copy the same length, so that you can go straight to the Nth entry in the sorted file.

If you don't have the disk space for that, then create an index file, which has all the keys in the original file as key and the offset into the original file as the value. Again used fixed length records. Or better, make this index file a database. Or load the original file into a database. In either case, disk storage is very cheap.

EDIT: To create the index file, open the main file using RandomAccessFile and read it sequentially. Use the 'getFilePointer()' method at the start of each entry to read the position in the file, and store that plus the key in the index file. When looking up something read the file pointer from the index file and use the 'seek(long)' method to jump to the point in the original file.

Sign up to request clarification or add additional context in comments.

1 Comment

Actually I wanted to ask how to create the index file that you mentioned using java file handling api and which classes/methods will be helpful in creating and reading the index.
4

I'd recommend building an index file. Scan the input file and write every key and its offset into a List, then sort the list and write it to the index file. Then, whenever you want to look up a key, you read in the index file and do a binary search on the list. Once you find the key you need, open the data file as a RandomAccessFile and seek to the position of the key. Then you can read the key and the value.

Comments

0

I'd recommend building an index: either in-memory or on-disc. Then, whenever you want to read back a specific key value pair, you can do that in pretty much constant time O(1)¹

Let's say we have file:

Rust ➡️ if it works, fix it anyway
Ruby ➡️ easy to understand, especially after the first five years
JavaScript ➡️ car is to carpet as Java is JavaScript
C ➡️ I don't care what you think, I'm faster than you

In-memory

Scan file once to build an index:

String pattern = "(.*) ➡️ (.*)";
var file = Files.newByteChannel(Paths.get("file"));

Map<String, Integer> index = new Scanner(file)
        .findAll(pattern)

and store it in RAM:

        .collect(toMap(record -> record.group(1), MatchResult::start));

then, if the user enters JavaScript, look it up in the index:

int offset = index.get("JavaScript");
new Scanner(file.position(offset))
        .findAll(pattern)
        .map(MatchResult::group)
        .findFirst()
        .ifPresent(System.out::println);

― will print:

JavaScript ➡️ I promise to call you back

Persistent

We need a data structure optimized for disc. Luckily, it is already implemented for us by the file system. Given a file path, the file system finds the file on a disc.

Hence, if we encode our keys in terms of file paths and our offsets in terms of files, we can delegate the job to the file system.

Just store the index on a disc instead of in RAM:

        .forEach(record -> Files.writeString(Paths.get(record.group(1)),
                                        String.valueOf(record.start()))); // try catch

and adjust the lookup accordingly:

int offset = Integer.parseInt(Files.readString(Paths.get("JavaScript")));

¹ assuming entry size has a fixed upper bound; and depending on the file system and whether it's fragmented

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.