1

Long time ago, inspired by "Numerical recipes in C", I started to use the following construct for storing matrices (2D-arrays).

double **allocate_matrix(int NumRows, int NumCol)
{
  double **x;
  int i;

  x = (double **)malloc(NumRows * sizeof(double *));
  for (i = 0; i < NumRows; ++i) x[i] = (double *)calloc(NumCol, sizeof(double));
  return x;
}

double **x = allocate_matrix(1000,2000);
x[m][n] = ...;

But recently noticed that many people implement matrices as follows

double *x = (double *)malloc(NumRows * NumCols * sizeof(double));
x[NumCol * m + n] = ...;

From the locality point of view the second method seems perfect, but has awful readability... So I started to wonder, is my first method with storing auxiliary array or **double pointers really bad or the compiler will optimize it eventually such that it will be more or less equivalent in performance to the second method? I am suspicious because I think that in the first method two jumps are made when accessing the value, x[m] and then x[m][n] and there is a chance that each time the CPU will load first the x array and then x[m] array.

p.s. do not worry about extra memory for storing **double, for large matrices it is just a small percentage.

P.P.S. since many people did not understand my question very well, I will try to re-shape it: do I understand right that the first method is kind of locality-hell, when each time x[m][n] is accessed first x array will be loaded into CPU cache and then x[m] array will be loaded thus making each access at the speed of talking to RAM. Or am I wrong and the first method is also OK from data-locality point of view?

7
  • "From the locality point of view the second method seems perfect, but has awful readability... " I would say both are awful, as they both use pointers and malloc IMO. Commented May 17, 2017 at 16:21
  • 1
    @John Smith It is a C approach to allocate arrays. And the compiler can not optimize the memory allocation because it is performed at run-time. Commented May 17, 2017 at 16:24
  • 3
    @John Smith it is definitely worth it to ensure the 2d array is allocated as a contiguous block. As an answer points out you can use a wrapper to do this while maintaining your "pretty" interface. But what would be even more beneficial in the long run is to use a good matrix library as it will do all of this for you plus all kinds of other optimizations (expression templates, SIMD, etc). Commented May 17, 2017 at 16:30
  • 1
    If this is a practical question rather than academic, do what @RyanP suggests. Download Eigen. Then your code becomes Eigen::MatrixXd x(NumRows, NumCols); and x(m, n) = ...; Commented May 17, 2017 at 16:50
  • 1
    RyanP, I never understood, why I need I good library for doing simple things. They will do x[m][n] access using some super-secret method? Some people say "you have to use STL vectors", but they probably do not remember that few years ago Microsoft implementation of vectors was awful, and vector<vector<double> > was many times slower than simple array... Commented May 17, 2017 at 16:54

5 Answers 5

3

For C-style allocations you can actually have the best of both worlds:

double **allocate_matrix(int NumRows, int NumCol)
{
  double **x;
  int i;

  x = (double **)malloc(NumRows * sizeof(double *));
  x[0] = (double *)calloc(NumRows * NumCol, sizeof(double)); // <<< single contiguous memory allocation for entire array
  for (i = 1; i < NumRows; ++i) x[i] = x[i - 1] + NumCols;
  return x;
}

This way you get data locality and its associated cache/memory access benefits, and you can treat the array as a double ** or a flattened 2D array (array[i * NumCols + j]) interchangeably. You also have fewer calloc/free calls (2 versus NumRows + 1).

Sign up to request clarification or add additional context in comments.

6 Comments

but will this method have a lot of cache misses or not?
No, that's the whole point of using a single large contiguous allocation for the underlying data.
then may be I misunderstand the way the CPU works? I thought that in this method when you access x[m][n] first x[m] will be read. In order to read x[m] the CPU will download part of the double **x array into the cache. Then it will read the value of x[m] and will retrieve the actual data, which is of course different part of the memory, thus making CPU to reload the cache line. Is my logic somehow deficient?
If you're iterating through the array sequentially then you only need to access the row pointer x[m] once per row. You also have the option to treat the array as flattened (see above), which might be more efficient for a more random access pattern (although such an access pattern is very cache inefficient to begin with of course).
but I do not want to access x[m] and store it each time I am accessing rows. what happens if I have cycle for (int m=0; m<nr; ++m) for (int n=0; n<nr; ++n) x[m][n] = m+n; ? then x[m] will be accessed each time, when x[m][n] is accessed, right?
|
1

No need to guess whether the compiler will optimize the first method. Just use the second method which you know is fast, and use a wrapper class that implements for example these methods:

double& operator(int x, int y);
double const& operator(int x, int y) const;

... and access your objects like this:

arr(2, 3) = 5;

Alternatively, if you can bear a little more code complexity in the wrapper class(es), you can implement a class that can be accessed with the more traditional arr[2][3] = 5; syntax. This is implemented in a dimension-agnostic way in the Boost.MultiArray library, but you can do your own simple implementation too, using a proxy class.

Note: Considering your usage of C style (a hardcoded non-generic "double" type, plain pointers, function-beginning variable declarations, and malloc), you will probably need to get more into C++ constructs before you can implement either of the options I mentioned.

1 Comment

I do not see a huge problem in using malloc. I am pretty sure that STL vectors use mallocs inside. But anyway, my main question was: will the first method be hugely inefficient because of cache misses when accessing x[m][n] or it is more or less the same in that sense as the second method? I am not worried here about the array being not completely contiguous, since I can easily make it contiguous (see for example the answer by Paul R). What I am worried about is having extra array of double * and 2 redirections: first from x to x[m] and then to x[m][n]
1

The two methods are quite different.

  • While the first method allows for easier direct access to the values by adding another indirection (the double** array, hence you need 1+N mallocs), ...
  • the second method guarantees that ALL values are stored contiguously and only requires one malloc.

I would argue that the second method is always superior. Malloc is an expensive operation and contiguous memory is a huge plus, depending on the application.

In C++, you'd just implement it like this:

std::vector<double> matrix(NumRows * NumCols);
matrix[y * numCols + x] = value;  // Access

and if you're concerned with the inconvenience of having to compute the index yourself, add a wrapper that implements operator(int x, int y) to it.

You are also right that the first method is more expensive when accessing the values. Because you need two memory lookups as you described x[m] and then x[m][n]. There is no way the compiler will "optimize this away". The first array, depending on its size, will be cached, and the performance hit may not be that bad. In the second case, you need an extra multiplication for direct access.

Comments

0

In the first method you use, the double* in the master array point to logical columns (arrays of size NumCol).

So, if you write something like below, you get the benefits of data locality in some sense (pseudocode):

foreach(row in rows):
    foreach(elem in row):
        //Do something

If you tried the same thing with the second method, and if element access was done the way you specified (i.e. x[NumCol*m + n]), you still get the same benefit. This is because you treat the array to be in row-major order. If you tried the same pseudocode while accessing the elements in column-major order, I assume you'd get cache misses given that the array size is large enough.

In addition to this, the second method has the additional desirable property of being a single contiguous block of memory which further improves the performance even when you loop through multiple rows (unlike the first method).

So, in conclusion, the second method should be much better in terms of performance.

Comments

0

If NumCol is a compile-time constant, or if you are using GCC with language extensions enabled, then you can do:

double (*x)[NumCol] = (double (*)[NumCol]) malloc(NumRows * sizeof (double[NumCol]));

and then use x as a 2D array and the compiler will do the indexing arithmetic for you. The caveat is that unless NumCol is a compile-time constant, ISO C++ won't let you do this, and if you use GCC language extensions you won't be able to port your code to another compiler.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.