OpenMP file reading in C++

Question

I have a piece of code that reads a file line by line and then it stores two corresponding binary representations in two vectors. But the size of vectors and the total number of lines processed are zero.

    int numLines = 319426908; // calculated before for loop
    char temp[100];
    vector<long long int> p1, p2;
    long long int c = 0;

    #pragma omp parallel for schedule(dynamic) shared(c, p1, p2, fp) private(temp)
      for(int i=0; i<numLines; i++){
        if(fgets(temp, 100, fp) != NULL){
          temp[strlen(temp)-1] = '\0';
          long long int *A = toLongLong(temp);
          p1.push_back(A[0]);
          p2.push_back(A[1]);
          c++;
        }
      }
      cout << "Completed ...c = " << c << endl;
      cout << "p1.size: " << p1.size() << " p2.size: " << p2.size() << endl;

This is the output

Completed ...c = 0
p1.size: 0 p2.size: 0

Where am I going wrong in the above piece of code?

You have a race condition. The push_back operations as well as the c++ (which should be ++c) are critical and have to go into a critical section. But reading from a file in parallel is usually not a good idea (there are file formats which are designed to be written and read in parallel such as HDF5). — Henri Menke
– Henri Menke, Commented Apr 3, 2017 at 23:33
doesn't shared clause take care of that? should I use #pragma omp critical for the block where I am updating p1, p2, and c ? — viz12
– viz12, Commented Apr 3, 2017 at 23:34
No, shared merely states explicitly that a variable is shared between threads. Synchronisation has to be handled by the user. Yes, you have to use a critical section but this will kill all the speedup you are expecting to get. — Henri Menke
– Henri Menke, Commented Apr 3, 2017 at 23:39
Thanks !! I am now trying with #pragma omp critical . My input file is a txt file containing strings of fixed size in each line. The number of lines is almost 300 millions — viz12
– viz12, Commented Apr 3, 2017 at 23:42

Community · Accepted Answer · 2017-05-23 11:46:24Z

Read all input first, then parallelize the processing of that.

I'm not sure if fgets() is thread safe, but:

If it is, then you wont get much benefit on parallelizing it. That's because of the blocking needed to guarantee no other thread is updating it (unless it uses a non-blocking approach, but I believe it shouldn't be the case here)
Otherwise, then you might end up reading multiple times the same data, or even corrupted data.

Another option you have is splitting the file, and processing those. Anyway, if the file is not too big, try reading it all and then processing it. If it's too big, try loading some records (Say, like two thousand lines) processing those in parallel, repeat. You may need to do some testing to see which approach suits you better.

~ Edit: As the others said, you might want to check the concurrent access to those variables too. To put it all together, maybe a reduction to calculate the final result? See: C++ OpenMP Parallel For Loop - Alternatives to std::vector

Henri Menke · Accepted Answer · 2017-04-04 00:04:06Z

3

To add an explicit example to MateusMP's answer. You should refactor your code to use more of the C++ standard library. There are many functions and data structures which will make your life easier and your code more readable.

#include <array>
#include <fstream>
#include <string>
#include <vector>

std::array<long long,2> toLongLong(std::string);

int main()
{
  int c = 0;
  std::vector<long long> p1;
  std::vector<long long> p2;

  int n_lines = 10;
  std::vector<std::string> lines(n_lines);

  // Read the file
  std::ifstream file("test.txt" , std::ios::in);
  for ( int i = 0; i < n_lines; ++i )
    std::getline(file, lines[i]);
  file.close();

  // Process the contents in parallel
  #pragma omp parallel
  {
    int c_local = 0;
    std::vector<long long> p1_local;
    std::vector<long long> p2_local;

    #pragma omp for
    for ( int i = 0; i < n_lines; ++i )
    {
      std::array<long long,2> A = toLongLong(lines[i]);
      p1_local.push_back(A[0]);
      p2_local.push_back(A[1]);
      ++c_local;
    }

    #pragma omp critical
    {
      p1.insert(p1.end(), p1_local.begin(), p1_local.end());
      p2.insert(p2.end(), p2_local.begin(), p2_local.end());
      c += c_local;
    }
  }
}

This compiles but doesn't link because toLongLong is not implemented.

answered Apr 4, 2017 at 0:04

Henri Menke

11k1 gold badge27 silver badges45 bronze badges

1 Comment

viz12 Over a year ago

Yes, this is safe approach but I did not want to store the lines in memory.

Collectives™ on Stack Overflow

OpenMP file reading in C++

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related