2

I have a complex interpreter reading in commands from (sometimes) multiples files (the exact details are out of scope) but it requires iterating over these multiple files (some could be GB is size, preventing nice buffering) multiple times.

I am looking to increase the speed of reading in each command from a file.

I have used the RDTSC (program counter) register to micro benchmark the code enough to know about >80% of the time is spent reading in from the files.

Here is the thing: the program that generates the input file is literally faster than to read in the file in my small interpreter. i.e. instead of outputting the file i could (in theory) just link the generator of the data to the interpreter and skip the file but that shouldn't be faster, right?

What am I doing wrong? Or is writing suppose to be 2x to 3x (at least) faster than reading from a file?

I have considered mmap but some of the results on http://lemire.me/blog/archives/2012/06/26/which-is-fastest-read-fread-ifstream-or-mmap/ appear to indicate it is no faster than ifstream. or would mmap help in this case?

details:

I have (so far) tried adding a buffer, tweaking parameters, removing the ifstream buffer (that slowed it down by 6x in my test case), i am currently at a loss for ideas after searching around.

The important section of the code is below. It does the following:

  1. if data is left in buffer, copy form buffer to memblock (where it is then used)
  2. if data is not left in the buffer, check to see how much data is left in the file, if more than the size of the buffer, copy a buffer sized chunk
  3. if less than the file

    //if data in buffer
    if(leftInBuffer[activefile] > 0)
    {
        //cout <<bufferloc[activefile] <<"\n";
        memcpy(memblock,(buffer[activefile])+bufferloc[activefile],16);
        bufferloc[activefile]+=16;
        leftInBuffer[activefile]-=16;
    }
    else //buffers blank
    {
        //read in block
    
        long blockleft =  (cfilemax -cfileplace) / 16 ;
        int read=0;
    
    /* slow block starts here */
    
        if(blockleft >= MAXBUFELEMENTS)
        {
            currentFile->read((char *)(&(buffer[activefile][0])),16*MAXBUFELEMENTS);
            leftInBuffer[activefile] = 16*MAXBUFELEMENTS;
            bufferloc[activefile]=0;
            read =16*MAXBUFELEMENTS;
        }
        else //read in part of the block
        {
            currentFile->read((char *)(&(buffer[activefile][0])),16*(blockleft));
            leftInBuffer[activefile] = 16*blockleft;
            bufferloc[activefile]=0;
            read =16*blockleft;
        }
    
     /* slow block ends here */
    
        memcpy(memblock,(buffer[activefile])+bufferloc[activefile],16);
        bufferloc[activefile]+=16;
        leftInBuffer[activefile]-=16;
    }
    

edit: this is on a mac, osx 10.9.5, with an i7 with a SSD

Solution:

as was suggested below, mmap was able to increase the speed by about 10x.

(for anyone else who searches for this) specifically open with:

uint8_t * openMMap(string name, long & size)
{
int m_fd;
struct stat statbuf;
uint8_t * m_ptr_begin;

if ((m_fd = open(name.c_str(), O_RDONLY)) < 0)
{
    perror("can't open file for reading");
}

if (fstat(m_fd, &statbuf) < 0)
{
    perror("fstat in openMMap failed");
}

if ((m_ptr_begin = (uint8_t *)mmap(0, statbuf.st_size, PROT_READ, MAP_SHARED,  m_fd, 0)) == MAP_FAILED)
{
    perror("mmap in openMMap failed");
}

uint8_t * m_ptr = m_ptr_begin;
size = statbuf.st_size;

return m_ptr;

}

read by:

    uint8_t *  mmfile = openMMap("my_file", length);        

    uint32_t * memblockmm;
    memblockmm = (uint32_t *)mmfile; //cast file to uint32 array
    uint32_t data = memblockmm[0]; //take int
    mmfile +=4; //increment by 4 as I read a 32 bit entry and each entry in mmfile is 8 bits.
9
  • On which operating system? What kind of interpreter are you coding? Do you represent some kind of abstract syntax trees in memory? Can't you use a pipe to connect the generator process to the parsing process? Commented Feb 12, 2015 at 18:04
  • right, i forgot: mac, osx 10.9.5, on i7 with SSD the interpreter takes in a program that composes instructions to represent a boolean circuit (+ function calls and copy instructions). Each is a uniform 16 bytes. In the compiler that generates the data, there is a nice AST structure (I spent a good deal of time on that, actually) but in the output file there is no structure, only gate commands, copy commands, and function calls. Commented Feb 12, 2015 at 18:09
  • I have though about using a pipe of some sort, at the moment that is the file. It is not suppose to be connected i.e. compile once and use the input file many times (i.e. my thought is it should be faster read it back in). Commented Feb 12, 2015 at 18:12
  • could it be that the write operation is going to a buffer (OS side) and thus returning faster than actually waiting for a read? Commented Feb 12, 2015 at 18:15
  • 1
    " instead of outputting the file i could (in theory) just link the generator of the data to the interpreter and skip the file but that shouldn't be faster, right?" - Why wouldn't that be faster? Writing to disk is always going to be a huge bottleneck. I would be tempted to have the generator write to std::out and the interpreter read from std::in and pipe the two. Commented Feb 12, 2015 at 18:36

2 Answers 2

2

This should be a comment, but I don't have 50 reputation to make a comment.

What is the value of MAXBUFELEMENTS? From my experience, many smaller reads is far slower than one read of larger size. I suggest to read the entire file in if possible, some files could be GBs, but even reading in 100MB at once would perform better than reading 1 MB 100 times.

If that's still not good enough, next thing you can try is to compress(zlib) input files(may have to break them into chunks due to size), and decompress them in memory. This method is usually faster than reading in uncompressed files.

Sign up to request clarification or add additional context in comments.

2 Comments

I played with it and decided to use 512 KBs as the buffer size after reading some information online, so 1024*512/16.
Yeah, I would try to max it out to order of 100MB since some input files are GB. Also piping between the processes as people suggested. Last resort would be to double buffer, using another thread or async read operations for preload the next buffer.
0

As @Tony Jiang said, try experimenting with the buffer size to see if that helps.

Try mmap to see if that helps.

I assume that currentFile is a std::ifstream? There's going to be some overhead for using iostreams (for example, an istream will do its own buffering, adding an extra layer to what you're doing); although I wouldn't expect the overhead to be huge, you can test by using open(2) and read(2) directly.

You should be able to run your code through dtruss -e to verify how long the read system calls take. If those take the bulk of your time, then you're hitting OS and hardware limits, so you can address that by piping, mmap'ing, or adjusting your buffer size. If those take less time than you expect, then look for problems in your application logic (unnecessary work on each iteration, etc.).

1 Comment

After some pain I tried mmap, and now i have it working about 10x faster than it previously was working. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.