implement SIMD in C++

Question

I'm working on a bit of code and I'm trying to optimize it as much as possible, basically get it running under a certain time limit.

The following makes the call...

static affinity_partitioner ap;
parallel_for(blocked_range<size_t>(0, T), LoopBody(score), ap);

... and the following is what is executed.

void operator()(const blocked_range<size_t> &r) const {

    int temp;
    int i;
    int j;
    size_t k;
    size_t begin = r.begin();
    size_t end = r.end();

    for(k = begin; k != end; ++k) { // for each trainee
        temp = 0;
        for(i = 0; i < N; ++i) { // for each sample
            int trr = trRating[k][i];
            int ei = E[i];              
            for(j = 0; j < ei; ++j) { // for each expert
                temp += delta(i, trr, exRating[j][i]);
            }
        }           
        myscore[k] = temp;
    }
}

I'm using Intel's TBB to optimize this. But I've also been reading about SIMD and SSE2 and things along that nature. So my question is, how do I store the variables (i,j,k) in registers so that they can be accessed faster by the CPU? I think the answer has to do with implementing SSE2 or some variation of it, but I have no idea how to do that. Any ideas?

Edit: This will be run on a Linux box, but using Intel's compiler I believe. If it helps, I have to run the following commands before I do anything to make sure the compiler works... source /opt/intel/Compiler/11.1/064/bin/intel64/iccvars_intel64.csh; source /opt/intel/tbb/2.2/bin/intel64/tbbvars.csh ... and then to compile I do: icc -ltbb test.cxx -o test

If there's no easy way to implement SSE2, any advice on how to further optimize the code?

Thanks, Hristo

@zdav: the semantics of C++ preclude vectorization as pointers by default may be unaligned or aliased. — Potatoswatter
– Potatoswatter, Commented Apr 29, 2010 at 16:58
ICC allows you to provide hints embedded in the code to help it do a better job of vectorisation. Of course if you have no control over the aligmnent of the supplied data etc then this isn't going to help much. — Paul R
– Paul R, Commented Apr 29, 2010 at 18:20

Jack Lloyd · Accepted Answer · 2010-04-29 17:08:50Z

1

Your question represents some confusion on what is going on. The i,j,k variables are almost certainly held in registers already, assuming you are compiling with optimizations on (which you should do - add "-O2" to your icc invocation).

You can use an asm block, but an easier method considering you're already using ICC is to use the SSE intrinsics. Intel's documentation for them is here - http://www.intel.com/software/products/compilers/clin/docs/ug_cpp/comm1019.htm

It looks like you can SIMD-ize the top-level loop, though it's going to depend greatly on what your delta function is.

answered Apr 29, 2010 at 17:08

Jack Lloyd

8,4152 gold badges40 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Hristo Over a year ago

I have no control over the icc invocation. This is a homework assignment so I'm very limited in what I can do. I can't even edit the delta function which can totally be optimized. I'll fiddle around with the asm block idea. Thanks.

Potatoswatter Over a year ago

@Hristo: Intrinsics should give you less trouble than the asm block. But do look into auto-vectorization. You should be able to find #pragma commands that emulate control over command-line flags.

jwismar · Accepted Answer · 2010-04-29 16:59:01Z

1

When you want to use assembly language within a C++ module, you can just put it inside an asm block, and continue to use your variable names from outside the block. The assembly instructions you use within the asm block will specify which register etc. is being operated on, but they will vary by platform.

answered Apr 29, 2010 at 16:59

jwismar

12.3k3 gold badges34 silver badges46 bronze badges

Comments

Potatoswatter · Accepted Answer · 2010-04-29 16:56:56Z

0

If you're using GCC, see http://gcc.gnu.org/projects/tree-ssa/vectorization.html for how to help the compiler auto-vectorize your code, and examples.

Otherwise, you need to let us know what platform you are using.

answered Apr 29, 2010 at 16:56

Potatoswatter

139k29 gold badges281 silver badges435 bronze badges

2 Comments

Hristo Over a year ago

This will be run on a Linux box, but using Intel's compiler I believe. If it helps, I have to run the following commands before I do anything to make sure the compiler works... source /opt/intel/Compiler/11.1/064/bin/intel64/iccvars_intel64.csh; source /opt/intel/tbb/2.2/bin/intel64/tbbvars.csh ... and then to compile I do: icc -ltbb test.cxx -o test

Potatoswatter Over a year ago

advogato.org/article/871.html is old but looks quite relevant. -xW -O2 -vec-report3. And see man icc and search for vector.

Puppy · Accepted Answer · 2010-04-29 17:42:49Z

0

The compiler should be doing this for you. For example, in VC++ you can simply turn on SSE2.

answered Apr 29, 2010 at 17:42

Puppy

147k40 gold badges271 silver badges481 bronze badges

1 Comment

tim18 Over a year ago

SSE2 auto-vectorization is a default option for icpc and an entirely normal one for g++ . From the looks of what little you show, it may require swapping inner loops as well as permitting inline expansion of the function. This goes even more so for the question about explicit simd. Why

Collectives™ on Stack Overflow

implement SIMD in C++

4 Answers 4

2 Comments

Comments

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related