Okay, I've run benchmarks on your loop. I moved it to a separate function and created 10 similar functions that have slight variations for each one. Your original is alg_0a--the reference to go by.
The others are various combinations of removal of something (e.g. use a fixed number instead of value[i]). The notion is that when you remove something [still necessary for the real algorithm] and the removal makes a dramatic performance improvement, the removed item is a performance bottleneck ["hot spot"]. The algorithms to look at are the alg_0*. Take a look at the loop in each one to see what gets left out.
There are alg_1* ones where I tried to get better cache performance/locality on the index and value arrays by combining them into a struct. This really had little to no effect because index/value are accessed sequentially, so they have good cache performance [even when separate] and the struct couldn't improve on this. So, you can skip the alg_1* data.
The worst performing [because it's doing the most work] is your original alg_0a. Slightly better was alg_0e, which fetches from the index and value arrays, but stores sequentially into the beta array. Even better was alg_0d which does what alg_0e does except that it stores into beta at a fixed index.
The only thing that really mattered was the access pattern when writing to beta. Because the index array has random indices for beta access in it, this causes the cache performance of the beta array to be poor.
The actual test data in the index array [being random] skews [and/or invalidates] the results. In the real program, if the array truly has semi-random indices in it, that is the "hot spot".
Is there a better model to generate the index array that is more representative and perhaps more cache friendly? That is, maybe the real access is more like 1,2,3,4 99,100,101,102 6000,6001,6002 ...
Without knowing how value and index are created and when, it's difficult to speculate. Are value and index always created at the same time?
In a sense, looping through index/value forms a "schedule" of updates to be applied to beta. If this schedule could be sorted by the beta index, this might provide more sequential beta access.
The beta array is larger than index/value. Would a different implementation for beta work better (e.g. linked list or sparse matrix)? That is, we want to make beta access as cache friendly as we can.
Ordinarily, I might suggest using openmp to parallelize this. But, from the tests I've run, this wouldn't help because of the random access pattern and the fact that the tests seem to show the loop is memory/cache bound and not CPU/computation bound
Now that I think of it, this looks like a scatterplot or monte carlo algorithm.
It might be beneficial to "invert" the problem. Here's a snippet:
for (betaidx = 0; betaidx < betasize; ++betaidx) {
if (beta_needs_change(betaidx))
beta[betaidx] = beta_new_value(betaidx);
}
Note that beta_needs_change would need to be fast [probably inline]. Most of the time it would return false. It might seem counterintuitive that this [much larger] loop could be faster than the existing index/value loop, but it might be worth trying to see.
My best guess is the "right" sparse matrix implementation might help.
Or, depending upon how beta is used after the loop, it might be possible to index everything by the same index variable:
for (idx = 0; idx < len; ++idx)
beta[idx] += d * value[idx];
Now, beta would have to read/utilized in a completely different way. value would probably have to be built up differently. But, recoding to do this might be worth it if you can reduce 40% of the program's CPU usage by a factor of 6x. As a positive side benefit, the rest of the program might benefit from the reorg as well.
Here is the [heavily] modified source code. It isn't buildable because it relies on my special tools and libraries. Ignore most of it and look at the loops inside the alg_0* functions.
#include <stdlib.h>
#include <math.h>
#define _BETALOOP_GLO_
#include <ovrlib/ovrlib.h>
#include <betaloop/bncdef.h>
#define FREEME(_ptr) \
do { \
if (_ptr != NULL) \
free(_ptr); \
_ptr = NULL; \
} while (0)
typedef int betaidx_t;
typedef int validx_t;
typedef struct {
double value;
betaidx_t index;
double *beta;
} pair_t;
typedef struct {
betaidx_t betasize; // range value for beta
validx_t vlen; // length of value array
double *beta;
double *value;
betaidx_t *index;
pair_t *pair;
} ctl_t;
// datagen_0a -- randomly pick len entries
void
datagen_0a(ctl_t *ctl,int betasize,int len)
{
double *beta;
double *value;
betaidx_t *index;
validx_t i;
betaidx_t ind;
memset(ctl,0,sizeof(ctl_t));
ctl->betasize = betasize;
ctl->vlen = len;
beta = calloc(betasize,sizeof(double));
ctl->beta = beta;
value = malloc(len * sizeof(double));
ctl->value = value;
index = malloc(len * sizeof(int));
ctl->index = index;
BNCBEG(datagen_0a);
for (i = 0; i < len; ++i) {
while (1) {
ind = floor(drand48() * betasize);
ind %= betasize;
if (beta[ind] == 0) {
value[i] = drand48();
index[i] = ind;
beta[ind] = 1;
break;
}
}
}
BNCEND(datagen_0a);
}
// datagen_0b -- randomly pick len entries
void
datagen_0b(ctl_t *ctl,betaidx_t betasize,int len)
{
double *beta;
double *value;
double curval;
betaidx_t *index;
byte *btv;
pair_t *pair;
validx_t validx;
betaidx_t betaidx;
memset(ctl,0,sizeof(ctl_t));
ctl->betasize = betasize;
ctl->vlen = len;
beta = calloc(betasize,sizeof(double));
ctl->beta = beta;
value = malloc(len * sizeof(double));
ctl->value = value;
index = malloc(len * sizeof(int));
ctl->index = index;
pair = malloc(len * sizeof(pair_t));
ctl->pair = pair;
btv = calloc(BTVSIZE(betasize),sizeof(byte));
BNCBEG(datagen_0b);
for (validx = 0; validx < len; ++validx) {
while (1) {
betaidx = floor(drand48() * betasize);
betaidx %= betasize;
if (! BTVTST(btv,betaidx)) {
BTVSET(btv,betaidx);
curval = drand48();
value[validx] = drand48();
index[validx] = betaidx;
if (pair != NULL) {
pair[validx].value = curval;
pair[validx].index = betaidx;
pair[validx].beta = &beta[betaidx];
}
beta[betaidx] = 1;
break;
}
}
}
BNCEND(datagen_0b);
free(btv);
}
// datarls_0 -- release allocated memory
void
datarls_0(ctl_t *ctl)
{
FREEME(ctl->beta);
FREEME(ctl->value);
FREEME(ctl->index);
FREEME(ctl->pair);
}
// fixed_index -- get fixed beta index
betaidx_t
fixed_index(ctl_t *ctl)
{
betaidx_t index;
while (1) {
index = floor(drand48() * ctl->betasize);
index %= ctl->betasize;
if ((index | 1) < ctl->betasize)
break;
}
return index;
}
// alg_0a -- Now the loop to be optimized
void
alg_0a(ctl_t *ctl)
{
double *beta;
double *value;
betaidx_t *index;
validx_t validx;
validx_t len;
const double d = 2.5;
BNCBEG(alg_0a);
beta = ctl->beta;
value = ctl->value;
index = ctl->index;
len = ctl->vlen;
for (validx = 0; validx < len; ++validx)
beta[index[validx]] += d * value[validx];
BNCEND(alg_0a);
}
// alg_0b -- null destination
double
alg_0b(ctl_t *ctl)
{
double beta;
double *value;
validx_t validx;
validx_t len;
const double d = 2.5;
BNCBEG(alg_0b -- betanull);
beta = 0.0;
value = ctl->value;
len = ctl->vlen;
for (validx = 0; validx < len; ++validx)
beta += d * value[validx];
BNCEND(alg_0b);
return beta;
}
// alg_0c -- fixed destination
void
alg_0c(ctl_t *ctl)
{
double *beta;
double *value;
betaidx_t index;
validx_t validx;
validx_t len;
const double d = 2.5;
index = fixed_index(ctl);
BNCBEG(alg_0c -- betafixed);
beta = ctl->beta;
value = ctl->value;
len = ctl->vlen;
for (validx = 0; validx < len; ++validx, index ^= 1)
beta[index] += d * value[validx];
BNCEND(alg_0c);
}
// alg_0d -- fixed destination with index array fetch
betaidx_t
alg_0d(ctl_t *ctl)
{
double *beta;
double *value;
betaidx_t *idxptr;
betaidx_t index;
validx_t validx;
validx_t len;
const double d = 2.5;
betaidx_t totidx;
index = fixed_index(ctl);
BNCBEG(alg_0d -- beta_fixed_index);
beta = ctl->beta;
value = ctl->value;
idxptr = ctl->index;
len = ctl->vlen;
totidx = 0;
for (validx = 0; validx < len; ++validx, index ^= 1) {
totidx += idxptr[validx];
beta[index] += d * value[validx];
}
BNCEND(alg_0d);
return totidx;
}
// alg_0e -- sequential destination with index array fetch
betaidx_t
alg_0e(ctl_t *ctl)
{
double *beta;
double *value;
betaidx_t *idxptr;
betaidx_t index;
validx_t validx;
validx_t len;
const double d = 2.5;
betaidx_t totidx;
BNCBEG(alg_0e -- beta_seq_index);
index = 0;
beta = ctl->beta;
value = ctl->value;
idxptr = ctl->index;
len = ctl->vlen;
totidx = 0;
for (validx = 0; validx < len; ++validx) {
totidx += idxptr[validx];
beta[index] += d * value[validx];
index = (index + 1) % ctl->betasize;
}
BNCEND(alg_0e);
return totidx;
}
// alg_0f -- null source
void
alg_0f(ctl_t *ctl)
{
double *beta;
double value;
betaidx_t *index;
validx_t validx;
validx_t len;
const double d = 2.5;
value = drand48();
BNCBEG(alg_0f -- nullsrc);
beta = ctl->beta;
index = ctl->index;
len = ctl->vlen;
for (validx = 0; validx < len; ++validx)
beta[index[validx]] += d * value;
BNCEND(alg_0f);
}
// alg_1a -- use pair struct with index
void
alg_1a(ctl_t *ctl)
{
double *beta;
validx_t validx;
validx_t len;
const pair_t *pair;
const double d = 2.5;
BNCBEG(alg_1a -- pair);
beta = ctl->beta;
len = ctl->vlen;
pair = ctl->pair;
for (validx = 0; validx < len; ++validx, ++pair)
beta[pair->index] += d * pair->value;
BNCEND(alg_1a);
}
// alg_1b -- use pair struct with epair
void
alg_1b(ctl_t *ctl)
{
double *beta;
const pair_t *pair;
const pair_t *epair;
const double d = 2.5;
BNCBEG(alg_1b -- epair);
beta = ctl->beta;
pair = ctl->pair;
epair = pair + ctl->vlen;
for (; pair < epair; ++pair)
beta[pair->index] += d * pair->value;
BNCEND(alg_1b);
}
// alg_1c -- use pair struct, epair, and beta pointer
void
alg_1c(ctl_t *ctl)
{
const pair_t *pair;
const pair_t *epair;
const double d = 2.5;
BNCBEG(alg_1c -- betap);
pair = ctl->pair;
epair = pair + ctl->vlen;
for (; pair < epair; ++pair)
*pair->beta += d * pair->value;
BNCEND(alg_1c);
}
// alg_1d -- fixed destination with index array fetch
betaidx_t
alg_1d(ctl_t *ctl)
{
double *beta;
const pair_t *pair;
const pair_t *epair;
const double d = 2.5;
betaidx_t index;
betaidx_t totidx;
index = fixed_index(ctl);
BNCBEG(alg_1d -- beta_fixed_index);
beta = ctl->beta;
pair = ctl->pair;
epair = pair + ctl->vlen;
totidx = 0;
for (; pair < epair; ++pair, index ^= 1) {
totidx += pair->index;
beta[index] += d * pair->value;
}
BNCEND(alg_1d);
return totidx;
}
// alg_1e -- sequential destination with index array fetch
betaidx_t
alg_1e(ctl_t *ctl)
{
double *beta;
const pair_t *pair;
const pair_t *epair;
const double d = 2.5;
betaidx_t index;
betaidx_t totidx;
BNCBEG(alg_1e -- beta_seq_index);
beta = ctl->beta;
pair = ctl->pair;
epair = pair + ctl->vlen;
totidx = 0;
index = 0;
for (; pair < epair; ++pair) {
totidx += pair->index;
beta[index] += d * pair->value;
index = (index + 1) % ctl->betasize;
}
BNCEND(alg_1e);
return totidx;
}
// dotest -- do test
void
dotest(int betasize,int len)
{
ctl_t ctl;
int tryidx;
printf("\n");
printf("dotest: %d %d\n",betasize,len);
BNCBEG(dotest);
#if 0
datagen_0a(&ctl,betasize,len);
#endif
#if 1
datagen_0b(&ctl,betasize,len);
#endif
for (tryidx = 1; tryidx <= 3; ++tryidx) {
alg_0a(&ctl);
alg_0b(&ctl);
alg_0c(&ctl);
alg_0d(&ctl);
alg_0e(&ctl);
alg_0f(&ctl);
alg_1a(&ctl);
alg_1b(&ctl);
alg_1c(&ctl);
alg_1d(&ctl);
alg_1e(&ctl);
}
datarls_0(&ctl);
BNCEND(dotest);
bncdmpa("dotest",1);
}
// main -- main program
int
main(int argc,char **argv)
{
--argc;
++argv;
bncatt(betaloop_bnc);
dotest(100000000,1000000);
dotest(500000000,5000000);
dotest(1000000000,10000000);
return 0;
}
Here is the benchmark output. The min timing (in fractional seconds) is the number to compare.
17:39:35.550606012 NEWDAY 11/18/15
17:39:35.550606012 ph: starting 13162 ...
17:39:35.551221132 ph: ARGV ovrgo ...
17:39:36 ovrgo: SDIR /home/cae/preserve/ovrbnc/betaloop
17:39:37 ovrgo: /home/cae/preserve/ovrstk/gen/betaloop/betaloop
bnctst: BEST bncmin=21 bncmax=-1 skipcnt=1000
bnctst: AVG tot=0.000000000
bnctst: SKP tot=0.000023607 avg=0.000000023 cnt=1,000
bnctst: BNCDMP min=0.000000000 max=0.000000000
dotest: 100000000 1000000
dotest: BNCDMP alg_0a tot=0.087135098 avg=0.029045032 cnt=3
dotest: BNCDMP min=0.028753797 max=0.029378731
dotest: BNCDMP alg_0b -- betanull tot=0.003669541 avg=0.001223180 cnt=3
dotest: BNCDMP min=0.001210105 max=0.001242112
dotest: BNCDMP alg_0c -- betafixed tot=0.005472318 avg=0.001824106 cnt=3
dotest: BNCDMP min=0.001815115 max=0.001830939
dotest: BNCDMP alg_0d -- beta_fixed_index tot=0.005654055 avg=0.001884685 cnt=3
dotest: BNCDMP min=0.001883760 max=0.001885919
dotest: BNCDMP alg_0e -- beta_seq_index tot=0.025247095 avg=0.008415698 cnt=3
dotest: BNCDMP min=0.008410631 max=0.008423921
dotest: BNCDMP alg_0f -- nullsrc tot=0.085769224 avg=0.028589741 cnt=3
dotest: BNCDMP min=0.028477846 max=0.028683057
dotest: BNCDMP alg_1a -- pair tot=0.090740003 avg=0.030246667 cnt=3
dotest: BNCDMP min=0.030003776 max=0.030385588
dotest: BNCDMP alg_1b -- epair tot=0.093591309 avg=0.031197103 cnt=3
dotest: BNCDMP min=0.030324733 max=0.032524565
dotest: BNCDMP alg_1c -- betap tot=0.091931228 avg=0.030643742 cnt=3
dotest: BNCDMP min=0.030357306 max=0.031191412
dotest: BNCDMP alg_1d -- beta_fixed_index tot=0.007939126 avg=0.002646375 cnt=3
dotest: BNCDMP min=0.002508210 max=0.002853244
dotest: BNCDMP alg_1e -- beta_seq_index tot=0.025939159 avg=0.008646386 cnt=3
dotest: BNCDMP min=0.008606238 max=0.008683529
dotest: BNCDMP datagen_0b tot=0.136931619
dotest: BNCDMP dotest tot=0.956365745
dotest: 500000000 5000000
dotest: BNCDMP alg_0a tot=0.737332506 avg=0.245777502 cnt=3
dotest: BNCDMP min=0.244778177 max=0.247548555
dotest: BNCDMP alg_0b -- betanull tot=0.018095312 avg=0.006031770 cnt=3
dotest: BNCDMP min=0.005912708 max=0.006225743
dotest: BNCDMP alg_0c -- betafixed tot=0.028059365 avg=0.009353121 cnt=3
dotest: BNCDMP min=0.009303443 max=0.009407530
dotest: BNCDMP alg_0d -- beta_fixed_index tot=0.029024875 avg=0.009674958 cnt=3
dotest: BNCDMP min=0.009550901 max=0.009752188
dotest: BNCDMP alg_0e -- beta_seq_index tot=0.127149609 avg=0.042383203 cnt=3
dotest: BNCDMP min=0.042218860 max=0.042529218
dotest: BNCDMP alg_0f -- nullsrc tot=0.724878907 avg=0.241626302 cnt=3
dotest: BNCDMP min=0.240794352 max=0.242174302
dotest: BNCDMP alg_1a -- pair tot=0.764044535 avg=0.254681511 cnt=3
dotest: BNCDMP min=0.253329522 max=0.256864373
dotest: BNCDMP alg_1b -- epair tot=0.769463084 avg=0.256487694 cnt=3
dotest: BNCDMP min=0.254830714 max=0.258763409
dotest: BNCDMP alg_1c -- betap tot=0.765345462 avg=0.255115154 cnt=3
dotest: BNCDMP min=0.254364352 max=0.256134647
dotest: BNCDMP alg_1d -- beta_fixed_index tot=0.039104441 avg=0.013034813 cnt=3
dotest: BNCDMP min=0.012103513 max=0.014354033
dotest: BNCDMP alg_1e -- beta_seq_index tot=0.130221038 avg=0.043407012 cnt=3
dotest: BNCDMP min=0.043143231 max=0.043752516
dotest: BNCDMP datagen_0b tot=2.060880641
dotest: BNCDMP dotest tot=6.611719277
dotest: 1000000000 10000000
dotest: BNCDMP alg_0a tot=1.726930574 avg=0.575643524 cnt=3
dotest: BNCDMP min=0.575218786 max=0.576291884
dotest: BNCDMP alg_0b -- betanull tot=0.035615393 avg=0.011871797 cnt=3
dotest: BNCDMP min=0.011820026 max=0.011948646
dotest: BNCDMP alg_0c -- betafixed tot=0.056452922 avg=0.018817640 cnt=3
dotest: BNCDMP min=0.018590739 max=0.019195537
dotest: BNCDMP alg_0d -- beta_fixed_index tot=0.057788343 avg=0.019262781 cnt=3
dotest: BNCDMP min=0.019061426 max=0.019560949
dotest: BNCDMP alg_0e -- beta_seq_index tot=0.253575597 avg=0.084525199 cnt=3
dotest: BNCDMP min=0.084169403 max=0.084902168
dotest: BNCDMP alg_0f -- nullsrc tot=1.718326633 avg=0.572775544 cnt=3
dotest: BNCDMP min=0.571082648 max=0.575134694
dotest: BNCDMP alg_1a -- pair tot=1.792905583 avg=0.597635194 cnt=3
dotest: BNCDMP min=0.590378177 max=0.603253947
dotest: BNCDMP alg_1b -- epair tot=1.797667694 avg=0.599222564 cnt=3
dotest: BNCDMP min=0.589916620 max=0.609757778
dotest: BNCDMP alg_1c -- betap tot=1.794606586 avg=0.598202195 cnt=3
dotest: BNCDMP min=0.593164605 max=0.604739478
dotest: BNCDMP alg_1d -- beta_fixed_index tot=0.073755595 avg=0.024585198 cnt=3
dotest: BNCDMP min=0.024126694 max=0.025124542
dotest: BNCDMP alg_1e -- beta_seq_index tot=0.261664945 avg=0.087221648 cnt=3
dotest: BNCDMP min=0.086277263 max=0.087966703
dotest: BNCDMP datagen_0b tot=4.160519571
dotest: BNCDMP dotest tot=14.607990774
17:39:59.970197677 ph: complete (ELAPSED: 00:00:24.418215274)
betais a sparse array, that has(betasize - len)(~95%-99%)0s, and haslennon-zero elements, which are randomly chosen, and will eventually equal1 + 2.5*rand48()? Can anything be done about using a sparse matrix, or is this just a subset of the problem, where you need the rest ofbetafor other calculations?