how to optimize nested indexing in a loop

Question

I have a very simple loop in C:

for (i=0; i < len; ++i) {
    beta[index[i]] += d * value[i];
}

In this loop beta and value are double arrays while index is an integer array. beta itself can be a very long array (potentially millions of elements), but len is typically much shorter, say, 5% of the length of beta. Of course, all the arrays are independent of each other. We can also assume that no two entries in index are the same. What bugs me is that no matter what I do nothing seems to help. So far I have tried using the restrict keyword, specifying #pragma ivdep, manual unrolling, prefetching (though I may have applied these last two with not the right unroll factor / prefetch lookahead), and even tried using mkl to first gather the values to be updated, do the update with daxpy, then scatter the results.

Any suggestions what could be done with this to make the loop as fast as possible? My platform is intel linux.

Thanks, --Laci

Here is a full code:

#include <stdlib.h>
#include <math.h>

int main(int argc, char *argv[])
{
   int betasize = atoi(argv[1]);
   int len = atoi(argv[2]);
   double *beta = calloc(betasize, sizeof(double));
   double *value = malloc(len*sizeof(double));
   int *index = malloc(len*sizeof(int));
   int i;
   const double d = 2.5;

   /* randomly pick len entries */
   for (i=0; i < len; ++i) {
      while (1) {
         const int ind = floor(drand48()*betasize);
         if (beta[ind] == 0) {
            value[i] = drand48();
            index[i] = ind;
            beta[ind] = 1;
            break;
         }
      }
   }

   /* Now the loop to be optimized */
   for (i=0; i < len; ++i) {
      beta[index[i]] += d * value[i];
   }
}

Put your favorite time measurement around the loop to be optimized and run it as 'a.out 100000000 1000000'. You can use smaller arrays as well, just it'll be harder to time. Also, be aware that the random index generation will really slow down as the second number gets closer to the first :-)... But in my use case the second number is usually no more than 1% of the first, and very rarely goes above 5%.

Can you post a bit more of your code? And, how do you know that your code is slow? Have you benchmarked it? Before you can tune it, you must be able to measure it. I do nanosecond resolution timings/benchmarks on code all the time, and I'd need more info to be able speculate/assist further. — Craig Estey
– Craig Estey, Commented Nov 18, 2015 at 4:08
I do not know if it is slow, per se, all I know is that this loop is in the heart of a large code and takes up ~40% of the runtime as measured by vtune. Therefore it is a prime candidate for trying to squeeze out every bit of performance from it. Since I have failed in doing anything better than the compiler can do to the loop, I was wondering if anyone had an idea... I will edit the post to have a complete code. — LaszloLadanyi
– LaszloLadanyi, Commented Nov 18, 2015 at 14:17
Tangential aside, I'm pretty sure there's a way to generate the random numbers without needing a while loop. Not sure if it's more work than necessary, and as you state - that's not the area for optimization at this point, so it doesn't really matter. But for completeness' sake, there should be a way to do it. — dwanderson
– dwanderson, Commented Nov 18, 2015 at 14:45
Just trying to think through this: beta is a sparse array, that has (betasize - len) (~95%-99%) 0s, and has len non-zero elements, which are randomly chosen, and will eventually equal 1 + 2.5*rand48() ? Can anything be done about using a sparse matrix, or is this just a subset of the problem, where you need the rest of beta for other calculations? — dwanderson
– dwanderson, Commented Nov 18, 2015 at 15:01
@dwanderson: No, you misunderstood. The question is just how to speed up the second loop. The rest is just to get a code that compiles, and I put random values in the arrays, so that the compiler can't precompute everything at compilation time. In the real application there is nothing random. beta will be a full dense array, from which only a tiny fraction needs to be updated using index, value, and d. — LaszloLadanyi
– LaszloLadanyi, Commented Nov 19, 2015 at 1:52

Craig Estey · Accepted Answer · 2015-11-19 02:50:06Z

Okay, I've run benchmarks on your loop. I moved it to a separate function and created 10 similar functions that have slight variations for each one. Your original is alg_0a--the reference to go by.

The others are various combinations of removal of something (e.g. use a fixed number instead of value[i]). The notion is that when you remove something [still necessary for the real algorithm] and the removal makes a dramatic performance improvement, the removed item is a performance bottleneck ["hot spot"]. The algorithms to look at are the alg_0*. Take a look at the loop in each one to see what gets left out.

There are alg_1* ones where I tried to get better cache performance/locality on the index and value arrays by combining them into a struct. This really had little to no effect because index/value are accessed sequentially, so they have good cache performance [even when separate] and the struct couldn't improve on this. So, you can skip the alg_1* data.

The worst performing [because it's doing the most work] is your original alg_0a. Slightly better was alg_0e, which fetches from the index and value arrays, but stores sequentially into the beta array. Even better was alg_0d which does what alg_0e does except that it stores into beta at a fixed index.

The only thing that really mattered was the access pattern when writing to beta. Because the index array has random indices for beta access in it, this causes the cache performance of the beta array to be poor.

The actual test data in the index array [being random] skews [and/or invalidates] the results. In the real program, if the array truly has semi-random indices in it, that is the "hot spot".

Is there a better model to generate the index array that is more representative and perhaps more cache friendly? That is, maybe the real access is more like 1,2,3,4 99,100,101,102 6000,6001,6002 ...

Without knowing how value and index are created and when, it's difficult to speculate. Are value and index always created at the same time?

In a sense, looping through index/value forms a "schedule" of updates to be applied to beta. If this schedule could be sorted by the beta index, this might provide more sequential beta access.

The beta array is larger than index/value. Would a different implementation for beta work better (e.g. linked list or sparse matrix)? That is, we want to make beta access as cache friendly as we can.

Ordinarily, I might suggest using openmp to parallelize this. But, from the tests I've run, this wouldn't help because of the random access pattern and the fact that the tests seem to show the loop is memory/cache bound and not CPU/computation bound

Now that I think of it, this looks like a scatterplot or monte carlo algorithm.

It might be beneficial to "invert" the problem. Here's a snippet:

for (betaidx = 0;  betaidx < betasize;  ++betaidx) {
    if (beta_needs_change(betaidx))
        beta[betaidx] = beta_new_value(betaidx);
}

Note that beta_needs_change would need to be fast [probably inline]. Most of the time it would return false. It might seem counterintuitive that this [much larger] loop could be faster than the existing index/value loop, but it might be worth trying to see.

My best guess is the "right" sparse matrix implementation might help.

Or, depending upon how beta is used after the loop, it might be possible to index everything by the same index variable:

for (idx = 0;  idx < len;  ++idx)
    beta[idx] += d * value[idx];

Now, beta would have to read/utilized in a completely different way. value would probably have to be built up differently. But, recoding to do this might be worth it if you can reduce 40% of the program's CPU usage by a factor of 6x. As a positive side benefit, the rest of the program might benefit from the reorg as well.

Here is the [heavily] modified source code. It isn't buildable because it relies on my special tools and libraries. Ignore most of it and look at the loops inside the alg_0* functions.

#include <stdlib.h>
#include <math.h>

#define _BETALOOP_GLO_
#include <ovrlib/ovrlib.h>
#include <betaloop/bncdef.h>

#define FREEME(_ptr) \
    do { \
        if (_ptr != NULL) \
            free(_ptr); \
        _ptr = NULL; \
    } while (0)

typedef int betaidx_t;
typedef int validx_t;

typedef struct {
    double value;
    betaidx_t index;
    double *beta;
} pair_t;

typedef struct {
    betaidx_t betasize;                 // range value for beta
    validx_t vlen;                      // length of value array

    double *beta;
    double *value;
    betaidx_t *index;

    pair_t *pair;
} ctl_t;

// datagen_0a -- randomly pick len entries
void
datagen_0a(ctl_t *ctl,int betasize,int len)
{
    double *beta;
    double *value;
    betaidx_t *index;
    validx_t i;
    betaidx_t ind;

    memset(ctl,0,sizeof(ctl_t));

    ctl->betasize = betasize;
    ctl->vlen = len;

    beta = calloc(betasize,sizeof(double));
    ctl->beta = beta;

    value = malloc(len * sizeof(double));
    ctl->value = value;

    index = malloc(len * sizeof(int));
    ctl->index = index;

    BNCBEG(datagen_0a);

    for (i = 0; i < len; ++i) {
        while (1) {
            ind = floor(drand48() * betasize);
            ind %= betasize;
            if (beta[ind] == 0) {
                value[i] = drand48();
                index[i] = ind;
                beta[ind] = 1;
                break;
            }
        }
    }

    BNCEND(datagen_0a);
}

// datagen_0b -- randomly pick len entries
void
datagen_0b(ctl_t *ctl,betaidx_t betasize,int len)
{
    double *beta;
    double *value;
    double curval;
    betaidx_t *index;
    byte *btv;
    pair_t *pair;
    validx_t validx;
    betaidx_t betaidx;

    memset(ctl,0,sizeof(ctl_t));

    ctl->betasize = betasize;
    ctl->vlen = len;

    beta = calloc(betasize,sizeof(double));
    ctl->beta = beta;

    value = malloc(len * sizeof(double));
    ctl->value = value;

    index = malloc(len * sizeof(int));
    ctl->index = index;

    pair = malloc(len * sizeof(pair_t));
    ctl->pair = pair;

    btv = calloc(BTVSIZE(betasize),sizeof(byte));

    BNCBEG(datagen_0b);

    for (validx = 0;  validx < len;  ++validx) {
        while (1) {
            betaidx = floor(drand48() * betasize);
            betaidx %= betasize;
            if (! BTVTST(btv,betaidx)) {
                BTVSET(btv,betaidx);

                curval = drand48();
                value[validx] = drand48();
                index[validx] = betaidx;

                if (pair != NULL) {
                    pair[validx].value = curval;
                    pair[validx].index = betaidx;
                    pair[validx].beta = &beta[betaidx];
                }

                beta[betaidx] = 1;
                break;
            }
        }
    }

    BNCEND(datagen_0b);

    free(btv);
}

// datarls_0 -- release allocated memory
void
datarls_0(ctl_t *ctl)
{

    FREEME(ctl->beta);
    FREEME(ctl->value);
    FREEME(ctl->index);
    FREEME(ctl->pair);
}

// fixed_index -- get fixed beta index
betaidx_t
fixed_index(ctl_t *ctl)
{
    betaidx_t index;

    while (1) {
        index = floor(drand48() * ctl->betasize);
        index %= ctl->betasize;
        if ((index | 1) < ctl->betasize)
            break;
    }

    return index;
}

// alg_0a -- Now the loop to be optimized
void
alg_0a(ctl_t *ctl)
{
    double *beta;
    double *value;
    betaidx_t *index;
    validx_t validx;
    validx_t len;
    const double d = 2.5;

    BNCBEG(alg_0a);

    beta = ctl->beta;
    value = ctl->value;
    index = ctl->index;
    len = ctl->vlen;

    for (validx = 0;  validx < len;  ++validx)
        beta[index[validx]] += d * value[validx];

    BNCEND(alg_0a);
}

// alg_0b -- null destination
double
alg_0b(ctl_t *ctl)
{
    double beta;
    double *value;
    validx_t validx;
    validx_t len;
    const double d = 2.5;

    BNCBEG(alg_0b -- betanull);

    beta = 0.0;
    value = ctl->value;
    len = ctl->vlen;

    for (validx = 0;  validx < len;  ++validx)
        beta += d * value[validx];

    BNCEND(alg_0b);

    return beta;
}

// alg_0c -- fixed destination
void
alg_0c(ctl_t *ctl)
{
    double *beta;
    double *value;
    betaidx_t index;
    validx_t validx;
    validx_t len;
    const double d = 2.5;

    index = fixed_index(ctl);

    BNCBEG(alg_0c -- betafixed);

    beta = ctl->beta;
    value = ctl->value;
    len = ctl->vlen;

    for (validx = 0;  validx < len;  ++validx, index ^= 1)
        beta[index] += d * value[validx];

    BNCEND(alg_0c);
}

// alg_0d -- fixed destination with index array fetch
betaidx_t
alg_0d(ctl_t *ctl)
{
    double *beta;
    double *value;
    betaidx_t *idxptr;
    betaidx_t index;
    validx_t validx;
    validx_t len;
    const double d = 2.5;
    betaidx_t totidx;

    index = fixed_index(ctl);

    BNCBEG(alg_0d -- beta_fixed_index);

    beta = ctl->beta;
    value = ctl->value;
    idxptr = ctl->index;
    len = ctl->vlen;
    totidx = 0;

    for (validx = 0;  validx < len;  ++validx, index ^= 1) {
        totidx += idxptr[validx];
        beta[index] += d * value[validx];
    }

    BNCEND(alg_0d);

    return totidx;
}

// alg_0e -- sequential destination with index array fetch
betaidx_t
alg_0e(ctl_t *ctl)
{
    double *beta;
    double *value;
    betaidx_t *idxptr;
    betaidx_t index;
    validx_t validx;
    validx_t len;
    const double d = 2.5;
    betaidx_t totidx;

    BNCBEG(alg_0e -- beta_seq_index);

    index = 0;
    beta = ctl->beta;
    value = ctl->value;
    idxptr = ctl->index;
    len = ctl->vlen;
    totidx = 0;

    for (validx = 0;  validx < len;  ++validx) {
        totidx += idxptr[validx];
        beta[index] += d * value[validx];
        index = (index + 1) % ctl->betasize;
    }

    BNCEND(alg_0e);

    return totidx;
}

// alg_0f -- null source
void
alg_0f(ctl_t *ctl)
{
    double *beta;
    double value;
    betaidx_t *index;
    validx_t validx;
    validx_t len;
    const double d = 2.5;

    value = drand48();

    BNCBEG(alg_0f -- nullsrc);

    beta = ctl->beta;
    index = ctl->index;
    len = ctl->vlen;

    for (validx = 0;  validx < len;  ++validx)
        beta[index[validx]] += d * value;

    BNCEND(alg_0f);
}

// alg_1a -- use pair struct with index
void
alg_1a(ctl_t *ctl)
{
    double *beta;
    validx_t validx;
    validx_t len;
    const pair_t *pair;
    const double d = 2.5;

    BNCBEG(alg_1a -- pair);

    beta = ctl->beta;
    len = ctl->vlen;
    pair = ctl->pair;

    for (validx = 0;  validx < len;  ++validx, ++pair)
        beta[pair->index] += d * pair->value;

    BNCEND(alg_1a);
}

// alg_1b -- use pair struct with epair
void
alg_1b(ctl_t *ctl)
{
    double *beta;
    const pair_t *pair;
    const pair_t *epair;
    const double d = 2.5;

    BNCBEG(alg_1b -- epair);

    beta = ctl->beta;
    pair = ctl->pair;
    epair = pair + ctl->vlen;

    for (;  pair < epair;  ++pair)
        beta[pair->index] += d * pair->value;

    BNCEND(alg_1b);
}

// alg_1c -- use pair struct, epair, and beta pointer
void
alg_1c(ctl_t *ctl)
{
    const pair_t *pair;
    const pair_t *epair;
    const double d = 2.5;

    BNCBEG(alg_1c -- betap);

    pair = ctl->pair;
    epair = pair + ctl->vlen;

    for (;  pair < epair;  ++pair)
        *pair->beta += d * pair->value;

    BNCEND(alg_1c);
}

// alg_1d -- fixed destination with index array fetch
betaidx_t
alg_1d(ctl_t *ctl)
{
    double *beta;
    const pair_t *pair;
    const pair_t *epair;
    const double d = 2.5;
    betaidx_t index;
    betaidx_t totidx;

    index = fixed_index(ctl);

    BNCBEG(alg_1d -- beta_fixed_index);

    beta = ctl->beta;
    pair = ctl->pair;
    epair = pair + ctl->vlen;
    totidx = 0;

    for (;  pair < epair;  ++pair, index ^= 1) {
        totidx += pair->index;
        beta[index] += d * pair->value;
    }

    BNCEND(alg_1d);

    return totidx;
}

// alg_1e -- sequential destination with index array fetch
betaidx_t
alg_1e(ctl_t *ctl)
{
    double *beta;
    const pair_t *pair;
    const pair_t *epair;
    const double d = 2.5;
    betaidx_t index;
    betaidx_t totidx;

    BNCBEG(alg_1e -- beta_seq_index);

    beta = ctl->beta;
    pair = ctl->pair;
    epair = pair + ctl->vlen;
    totidx = 0;
    index = 0;

    for (;  pair < epair;  ++pair) {
        totidx += pair->index;
        beta[index] += d * pair->value;
        index = (index + 1) % ctl->betasize;
    }

    BNCEND(alg_1e);

    return totidx;
}

// dotest -- do test
void
dotest(int betasize,int len)
{
    ctl_t ctl;
    int tryidx;

    printf("\n");
    printf("dotest: %d %d\n",betasize,len);

    BNCBEG(dotest);

#if 0
    datagen_0a(&ctl,betasize,len);
#endif
#if 1
    datagen_0b(&ctl,betasize,len);
#endif
    for (tryidx = 1;  tryidx <= 3;  ++tryidx) {
        alg_0a(&ctl);
        alg_0b(&ctl);
        alg_0c(&ctl);
        alg_0d(&ctl);
        alg_0e(&ctl);
        alg_0f(&ctl);

        alg_1a(&ctl);
        alg_1b(&ctl);
        alg_1c(&ctl);
        alg_1d(&ctl);
        alg_1e(&ctl);
    }
    datarls_0(&ctl);

    BNCEND(dotest);

    bncdmpa("dotest",1);
}

// main -- main program
int
main(int argc,char **argv)
{

    --argc;
    ++argv;

    bncatt(betaloop_bnc);

    dotest(100000000,1000000);
    dotest(500000000,5000000);
    dotest(1000000000,10000000);

    return 0;
}

Here is the benchmark output. The min timing (in fractional seconds) is the number to compare.

17:39:35.550606012 NEWDAY 11/18/15
17:39:35.550606012 ph: starting 13162 ...
17:39:35.551221132 ph: ARGV ovrgo ...

17:39:36 ovrgo: SDIR /home/cae/preserve/ovrbnc/betaloop
17:39:37 ovrgo: /home/cae/preserve/ovrstk/gen/betaloop/betaloop
bnctst: BEST bncmin=21 bncmax=-1 skipcnt=1000
bnctst: AVG tot=0.000000000

bnctst: SKP tot=0.000023607 avg=0.000000023 cnt=1,000
bnctst: BNCDMP   min=0.000000000 max=0.000000000

dotest: 100000000 1000000
dotest: BNCDMP alg_0a tot=0.087135098 avg=0.029045032 cnt=3
dotest: BNCDMP   min=0.028753797 max=0.029378731
dotest: BNCDMP alg_0b -- betanull tot=0.003669541 avg=0.001223180 cnt=3
dotest: BNCDMP   min=0.001210105 max=0.001242112
dotest: BNCDMP alg_0c -- betafixed tot=0.005472318 avg=0.001824106 cnt=3
dotest: BNCDMP   min=0.001815115 max=0.001830939
dotest: BNCDMP alg_0d -- beta_fixed_index tot=0.005654055 avg=0.001884685 cnt=3
dotest: BNCDMP   min=0.001883760 max=0.001885919
dotest: BNCDMP alg_0e -- beta_seq_index tot=0.025247095 avg=0.008415698 cnt=3
dotest: BNCDMP   min=0.008410631 max=0.008423921
dotest: BNCDMP alg_0f -- nullsrc tot=0.085769224 avg=0.028589741 cnt=3
dotest: BNCDMP   min=0.028477846 max=0.028683057
dotest: BNCDMP alg_1a -- pair tot=0.090740003 avg=0.030246667 cnt=3
dotest: BNCDMP   min=0.030003776 max=0.030385588
dotest: BNCDMP alg_1b -- epair tot=0.093591309 avg=0.031197103 cnt=3
dotest: BNCDMP   min=0.030324733 max=0.032524565
dotest: BNCDMP alg_1c -- betap tot=0.091931228 avg=0.030643742 cnt=3
dotest: BNCDMP   min=0.030357306 max=0.031191412
dotest: BNCDMP alg_1d -- beta_fixed_index tot=0.007939126 avg=0.002646375 cnt=3
dotest: BNCDMP   min=0.002508210 max=0.002853244
dotest: BNCDMP alg_1e -- beta_seq_index tot=0.025939159 avg=0.008646386 cnt=3
dotest: BNCDMP   min=0.008606238 max=0.008683529
dotest: BNCDMP datagen_0b tot=0.136931619
dotest: BNCDMP dotest tot=0.956365745

dotest: 500000000 5000000
dotest: BNCDMP alg_0a tot=0.737332506 avg=0.245777502 cnt=3
dotest: BNCDMP   min=0.244778177 max=0.247548555
dotest: BNCDMP alg_0b -- betanull tot=0.018095312 avg=0.006031770 cnt=3
dotest: BNCDMP   min=0.005912708 max=0.006225743
dotest: BNCDMP alg_0c -- betafixed tot=0.028059365 avg=0.009353121 cnt=3
dotest: BNCDMP   min=0.009303443 max=0.009407530
dotest: BNCDMP alg_0d -- beta_fixed_index tot=0.029024875 avg=0.009674958 cnt=3
dotest: BNCDMP   min=0.009550901 max=0.009752188
dotest: BNCDMP alg_0e -- beta_seq_index tot=0.127149609 avg=0.042383203 cnt=3
dotest: BNCDMP   min=0.042218860 max=0.042529218
dotest: BNCDMP alg_0f -- nullsrc tot=0.724878907 avg=0.241626302 cnt=3
dotest: BNCDMP   min=0.240794352 max=0.242174302
dotest: BNCDMP alg_1a -- pair tot=0.764044535 avg=0.254681511 cnt=3
dotest: BNCDMP   min=0.253329522 max=0.256864373
dotest: BNCDMP alg_1b -- epair tot=0.769463084 avg=0.256487694 cnt=3
dotest: BNCDMP   min=0.254830714 max=0.258763409
dotest: BNCDMP alg_1c -- betap tot=0.765345462 avg=0.255115154 cnt=3
dotest: BNCDMP   min=0.254364352 max=0.256134647
dotest: BNCDMP alg_1d -- beta_fixed_index tot=0.039104441 avg=0.013034813 cnt=3
dotest: BNCDMP   min=0.012103513 max=0.014354033
dotest: BNCDMP alg_1e -- beta_seq_index tot=0.130221038 avg=0.043407012 cnt=3
dotest: BNCDMP   min=0.043143231 max=0.043752516
dotest: BNCDMP datagen_0b tot=2.060880641
dotest: BNCDMP dotest tot=6.611719277

dotest: 1000000000 10000000
dotest: BNCDMP alg_0a tot=1.726930574 avg=0.575643524 cnt=3
dotest: BNCDMP   min=0.575218786 max=0.576291884
dotest: BNCDMP alg_0b -- betanull tot=0.035615393 avg=0.011871797 cnt=3
dotest: BNCDMP   min=0.011820026 max=0.011948646
dotest: BNCDMP alg_0c -- betafixed tot=0.056452922 avg=0.018817640 cnt=3
dotest: BNCDMP   min=0.018590739 max=0.019195537
dotest: BNCDMP alg_0d -- beta_fixed_index tot=0.057788343 avg=0.019262781 cnt=3
dotest: BNCDMP   min=0.019061426 max=0.019560949
dotest: BNCDMP alg_0e -- beta_seq_index tot=0.253575597 avg=0.084525199 cnt=3
dotest: BNCDMP   min=0.084169403 max=0.084902168
dotest: BNCDMP alg_0f -- nullsrc tot=1.718326633 avg=0.572775544 cnt=3
dotest: BNCDMP   min=0.571082648 max=0.575134694
dotest: BNCDMP alg_1a -- pair tot=1.792905583 avg=0.597635194 cnt=3
dotest: BNCDMP   min=0.590378177 max=0.603253947
dotest: BNCDMP alg_1b -- epair tot=1.797667694 avg=0.599222564 cnt=3
dotest: BNCDMP   min=0.589916620 max=0.609757778
dotest: BNCDMP alg_1c -- betap tot=1.794606586 avg=0.598202195 cnt=3
dotest: BNCDMP   min=0.593164605 max=0.604739478
dotest: BNCDMP alg_1d -- beta_fixed_index tot=0.073755595 avg=0.024585198 cnt=3
dotest: BNCDMP   min=0.024126694 max=0.025124542
dotest: BNCDMP alg_1e -- beta_seq_index tot=0.261664945 avg=0.087221648 cnt=3
dotest: BNCDMP   min=0.086277263 max=0.087966703
dotest: BNCDMP datagen_0b tot=4.160519571
dotest: BNCDMP dotest tot=14.607990774

17:39:59.970197677 ph: complete (ELAPSED: 00:00:24.418215274)

I really appreciate the time you've spent on this. Unfortunately I have to answer most of your questions in the negative :-(.
First: this loop is the heart of the reduced cost update in the simplex algorithm. The index/vector pair is a sparse representation of a row in a matrix of size mxn. beta is the reduced cost vector (its size is n, the number of columns). Whenever an entry in the dual vector (which is of size m) changes, beta needs to be updated. This means that even though one call of this loop changes a tiny fraction of beta's entries, overall in the algorithm pretty much every entry in beta will be changed, probably many times.
This pretty much excludes any other storage format for beta (at least I can't come up with anything better than a regular array). And yes, cache performance is the bottleneck. I have even tried a lookahead prefetch by adding __builtin_prefetch(&beta[index[i+3]], 1, 0); within the loop in the hope that by the time I execute the next three updates the data will be in the cache. I tried a lookahead of 2, 3, and 5, but nothing helped.
I was also wondering if it is possible to force a larger chunk of memory into the cache with a single instruction? After all, I can get 32K doubles into the 256K L2 cache a core has (OK, I'd be more conservative and get only 16K doubles to leave half the cache for others). So if there is an instruction that forces, say, 128K memory into L2 in one swoop and marks that to be kept in L2 until this loop is over that might help if the number of columns is not too large. But I could not find any such instruction.
To conclude: this may be just one of those code pieces that just simply cannot be made run any faster than the memory latency and bandwidth allows. That's how it looked to me, and your experiments seem to confirm my suspicion.

Collectives™ on Stack Overflow

how to optimize nested indexing in a loop

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related