I have a program which creates two lists of numbers, sorted and stored on disk. The task is is to merge both the lists using linear merge. They both are sorted and fast to merge. One list contains 1,00,000 items and the other around 80,000. Each item is a 4 byte integer. Here is the layout:
list1 length | list1 | list2 length | list2
To merge there are two techniques:-
- Use a buffer of say 1024*16 bytes and read from each of the list. First seek to the list1 then fetch and then same on the list2. And some integers may be ignored as it may not be present in the current buffer, so keep last unmatched integers to be matched while fetching the next buffer. Repeat until end of lists. This may require seeking many times.
- Create two file pointer and read from each lists. Seek the first file pointer to the starting of list1 and the second file pointer to the start of list2.
My question is that reading buffer by buffer is good(You have to seek many times to each list) or using multiple file pointers? Is opening a single file multiple times good practice? I think from the hardware level, both are the same but is there any difference from the software level or from the operating system side? The two lists are for example; I have multiple lists to merge.
fread()? Are you aiming to omit duplicate entries — do you omit them from list1 or list2 (or both)?