Using multiple files vs multiple buffers. What to use?

Ask Question

Asked 4 years, 10 months ago

Modified 4 years, 10 months ago

Viewed 206 times

I have a program which creates two lists of numbers, sorted and stored on disk. The task is is to merge both the lists using linear merge. They both are sorted and fast to merge. One list contains 1,00,000 items and the other around 80,000. Each item is a 4 byte integer. Here is the layout:

list1 length | list1 | list2 length | list2

To merge there are two techniques:-

Use a buffer of say 1024*16 bytes and read from each of the list. First seek to the list1 then fetch and then same on the list2. And some integers may be ignored as it may not be present in the current buffer, so keep last unmatched integers to be matched while fetching the next buffer. Repeat until end of lists. This may require seeking many times.
Create two file pointer and read from each lists. Seek the first file pointer to the starting of list1 and the second file pointer to the start of list2.

My question is that reading buffer by buffer is good(You have to seek many times to each list) or using multiple file pointers? Is opening a single file multiple times good practice? I think from the hardware level, both are the same but is there any difference from the software level or from the operating system side? The two lists are for example; I have multiple lists to merge.

edited Jan 18, 2021 at 6:27

Jonathan Leffler

759k145 gold badges961 silver badges1.3k bronze badges

asked Jan 18, 2021 at 6:11

user12320641

2

Do it the simple way first (with file pointers). Time it, if you are happy, move on. If you are not, only then look to optmise. I doubt using your own buffer will help, because both C++ and your O/S will already be buffering the input, but you can try. Opening the same file multiple times is defintely bad. My intuition is that you will see the most difference depending on the API you use, C++ vs C vs Posix. But as always the only advice is to try different ways and time the results.

john
– john

2021-01-18 06:23:22 +00:00
Commented Jan 18, 2021 at 6:23
2

Since nowadays 100000 records are considered as "tiny" in PC environments, it would maybe be the best to read all data to RAM, do the operation there, and store the result

A M
– A M

2021-01-18 06:28:06 +00:00
Commented Jan 18, 2021 at 6:28
1

The "layout" is not self-explanatory. Can you show what you mean? Is it the input or the output format? Are you saying that the files contain 4-byte binary integers, written with fread()? Are you aiming to omit duplicate entries — do you omit them from list1 or list2 (or both)?

Jonathan Leffler
– Jonathan Leffler

2021-01-18 06:28:21 +00:00
Commented Jan 18, 2021 at 6:28
1

@Mr.Explorerexplorer the truth is there is the simple way: use two streams. There is the ultimate easy way: memory map all of the data. Everything in between is more complicated.

Antti Haapala
– Antti Haapala

2021-01-18 06:30:42 +00:00
Commented Jan 18, 2021 at 6:30
3

Have you measured the performance degradation? "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

Jonathan Leffler
– Jonathan Leffler

2021-01-18 06:30:45 +00:00
Commented Jan 18, 2021 at 6:30

| Show 11 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Collectives™ on Stack Overflow

Using multiple files vs multiple buffers. What to use?

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest