3

I am writing a piece of code to read in several GB of data that spans multiple files using C++ IOStreams, which I've chosen over the C API for a number of design reasons that I won't bore you with. Since the data is produced by a separate program on the same machine where my code will run, I am confident that issues such as those relating to endianess can, for the most part, be ignored.

The files have a reasonably complicated structure. For example, there is a header that describes the number of records of a particular binary configuration. Later in the file, I must make the code conditionally read that number of lines. This sort of pattern is repeated in a complicated, but well-documented way.

My question is related to how to do this efficiently - I'm sure my process is going to be IO-limited, so my instinct is that rather than reading in data in smallish blocks, such as the following approach

std::vector<int> buffer;
buffer.reserve(500);
file.read( (char*)&buffer[0], 500 * sizeof(int));

I should read in one file entirely at a time and try to process it in memory. So my interrelated questions:

  • Given that this would seem to mean reading in a char* or std::vector array, how would you best go about converting this array into the data format required to correctly represent the file structure?
  • Are my assumptions incorrect?

I know the obvious answer is to try and then to profile later, and profile I certainly will. But this question is more about how to pick the right approach at the beginning - a sort of "pick the right algorithm" optimisation, rather than the sort of optimisations that I could envisage doing after identifying bottlenecks later on!

I'll be interested in the answers offered up - I tend to only be able to find answers for relatively simple binary files, for which the approach above is suitable. My problem is that the bulk of the binary data is structured conditionally on the numbers in the header to the file (even the header is formatted this way!) so I need to be able to process the file a little more carefully.

Thanks in advance.

EDIT: Some comments coming through about memory mapping - looks good, but not sure how to do it and all I've read tells me it isn't portable. I'm interested in trying an mmap, but also in more portable solutions (if any!)

9
  • 2
    One option to consider is to mmap the whole file in memory and then go with chars or small blocks. Commented Nov 17, 2011 at 21:14
  • Very interesting.. :) Can you elaborate on "binary data is structured conditionally on the numbers in the header to the file", e.g. is it as simple as the header describes the number ints in this file? Commented Nov 17, 2011 at 21:14
  • 1
    @thekashyap: Not quite. The file is describing certain objects and will use differently structured binary records for doing so. As a simplified example, it might tell be there are ten cats and 5 catflaps. A "cat" record might be structured as (int numWhiskers, int numLives, double tailLength) and a "catflap" might be (double height, double width, int numberOfOpenings). The header then tells me how many of each record to read when I get to that bit of the file. I hope that makes it clearer! Commented Nov 17, 2011 at 21:19
  • Do you always need the objects in the same order? Why not just read the file in large chunks and pass them to a parser? Commented Nov 17, 2011 at 21:22
  • @DavidSchwartz: Not sure what you mean - the file presents the objects to me in a particular order, so that's how I've been reading them. Think I've misunderstood what you are saying :) Commented Nov 17, 2011 at 21:24

4 Answers 4

6

Use a 64-bit OS and memory map the file. If you need to support a 32-bit OS as well, use a compatibility layer that maps chunks of the file as needed.

Alternatively, if you always need the objects in file order, just write a sane parser to handle the objects in chunks. Like this:

1) Read in 512KB of file.

2) Extract as many objects as possible from the data we read.

3) Read in as many bytes as needed to fill the buffer back up to 512KB. If we read no bytes at all, stop.

4) Go to step 3.

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks - is there a way of doing this in a portable way?
Reading a raw binary file is non portable from one architecture to another.
@Fritz There is no way of doing any of what you are suggesting portably.
@DavidSchwartz: Ah, ok - so if I want portability, I need to do it the (potentially) slower way. Otherwise I need to tie myself to a particular OS, or support multiple approaches using a compiler switch. Probably explains why I've struggled to find anything!
Right, the updated step-by-step approach is sort of what I was trying to get at above: my question is that if I pull in my 512K into a char-like buffer, and in the middle of that there are (for example) 10 records of { int32, int32, float, double float }, what is the most sane/efficient way of converting my char data from the array into the format of my record? For a simple example, what's the most efficient way of getting the 8 chars into the double?
|
1

You could mmap some file segments (or the entire file, at least on 64 bits machine). Perhaps use madvise and (in a separate thread) readahead

Comments

1

I guess you already have enough to start off, memory mapping is certainly a neat idea as long as you have enough RAM. Else read in big chunks.

Once the data is available in memory whole file or a big chunk, the simplest way to read is to:

  • define an appropriate struct
  • create a pointer to appropriate offset in the memory where data is loaded
  • reinterpret_cast the pointer to a pointer of type "appropriate struct" or an array of appropriate struct.

You can use #pragmas to ensure the packing size/order etc if needed. But again this would be OS/Compiler dependent.

5 Comments

Ah, an offset pointer in the memory! Of course! That's what I'm after! Thank you all (when I can upvote, I will!)
That's all I could think of. Glad it's not my job :)
It's my job to make it work, but the old way takes about 6 hours and I need the new way to work faster! :)
@Fritz sorry, was the first comment sarcasm? If it was I would be happy to show more, though your question made me think you already know a lot. I did exactly this 5 years back (to improve performance obviously). If it wasn't happy to help.
@thekashyap: no, not sarcasm! more of a slapping my forehead moment!
0

Well, OK, the header is of variable length, but you have to start somewhere. If you have to read in the whole file first, it can get a bit messy. The whole file can be represented by a struct containing the header up until some length descriptor and then a byte array - you can start there. Once you have the header length, you can set a pointer/length to an array of header entries and then iterate them and so set a pointer/length for an array of file content structs and so on and so on..

All the various arrays of structs would probably need to be packed?

Nasty. I don't really like my own design:(

Anyone got a better idea, other than rewriting the 'separate program' to use a database or XML or something?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.