0

I have a code I'm trying to parallelize, but I'm finding that the number of threads I use gives me an approximately 0.5x slow down. For instance, I use 4 threads and it runs twice as slow.

-edit: sorry, had the wrong portion of the program in here before.

 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

!Compute Dynamic Structure Factor of Q,T=const
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
SUBROUTINE COMPUTE_SQ(i_vmax,ION_COUNT,ions,t2,lines,lf,rx,ry,rz,r,x,y,z,s,simtype,vmax,q)

IMPLICIT NONE

Integer*4, INTENT(IN)                   ::i_vmax,ION_COUNT,ions,t2,lines
Real*8,    INTENT(IN)                   ::lf,vmax
Integer*4, INTENT(IN), DIMENSION(1:51200)       ::rx,ry,rz
Real*4,    INTENT(IN), DIMENSION(1:51200,0:1000)    ::x,y,z
Real*8,    INTENT(INOUT),DIMENSION(1:i_vmax)        ::s
Complex*16,INTENT(INOUT),DIMENSION(1:i_vmax,2)      ::r
Complex*16,INTENT(INOUT),DIMENSION(1:i_vmax,2)      ::q
Real*8,            DIMENSION(1:i_vmax)      ::si,co
Integer*4                       ::k,i,p_start,p_end
Real*8                          ::dotprod,co_temp,si_temp
Character*5,INTENT(IN)                  ::simtype

!!!!!!RE-INITIALIZE VARIABLES

Do 300 k=1,i_vmax
    !if (mag(k).gt.vmax) then
    !cycle
    !endif
    co(k)=0
    si(k)=0
    co_temp=0
    si_temp=0
    write(*,*) vmax 
300 continue

!!!!!!!!!!!!!!!!!!!!

if (simtype.eq.'pfrac') then
    p_start=30721
    p_end=51200
else if (simtype.eq.'nfrac') then
    p_start=0
    p_end=30720
else
    write (*,*) 'simtype not specified'
endif

!!!!!!!!!!!!!!!!!!!!!!

Do 31 k=1,i_vmax
!    if (mag(k).gt.vmax) then
!    cycle
!   endif
    co_temp=0
    si_temp=0

    !$OMP PARALLEL DO PRIVATE(dotprod,Qcur,co_temp,si_temp)
        Do 41 i=p_start,p_end
         dotprod=(rx(k)*x(i,t2)+ry(k)*y(i,t2)+rz(k)*z(i,t2))*lf
         co_temp=co_temp+COS(dotprod) !Qcur/Qavg
         si_temp=si_temp+SIN(dotprod) !Qcur/Qavg
    41 continue
    !$OMP END PARALLEL DO
     q(k,2)= cmplx(co_temp,si_temp)
     r(k,2)=r(k,2)+q(k,2)
     s(k)=s(k) +(q(k,1) * conjg(q(k,2)))
     s(k)=s(k)/(p_end-p_start+1)**2
     !r(k,2)=r(k,2)/(p_end-p_start+1)
31 continue
RETURN
END SUBROUTINE COMPUTE_SQ

Here is the portion of the code that's relevant. At first I had the OMP part over the entire subroutine, but I thought they all might be trying to read the same values and it was slowing it down, but that doesn't seem to be the case since its the same speed regardless of which loop it goes over.

For reference, the inner loop is over about 20,000 iterations and the outer about 1000.

I'm using the Intel compiler 4.1.40 with the flags -mcmodel=medium -shared-intel (since it uses >2GB of mem) and -openmp of course. I've tried it with 1,2,4,8,16 cores and each successive doubling of cores gives me ~ 1.5x the time to run.

Any ideas appreciated!

5
  • 2
    Too many empty lines, too much commented out code, not enough indentation. I don't think that's slowing your code down but it is definitely making it unappealing to read. Commented Nov 22, 2013 at 20:33
  • can you post actual timings, and also the timing for the inner loop versus outer loop for a 1-thread run? Commented Nov 22, 2013 at 21:47
  • sorry, I use the first 4 lines, so everything that did have strucutred got shifted to the fourth line when I posted it here. Commented Nov 22, 2013 at 23:43
  • System won't let me answer my own question until 3 AM, but I found the solution: It was a problem with Cray's silly APLS system not the actual code. Seems the default on aprun is to run separate versions of your code identically on as many cores as you specify, need to use -d 16 (or whatever # cores) flag after the aprun. Commented Nov 22, 2013 at 23:45
  • Do you still believe your program is correct? I do not, see my answer. Commented Nov 23, 2013 at 8:00

1 Answer 1

2

Are you sure you are getting correct results at all? The values of the private variables are undefined on entering the parallel region and after exiting it. In your case the problematic ones are si_temp and co_temp. You should use the REDUCTION clause instead.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks so much! I changed some things a while back and forgot to make those variables a reduction! Results changed substantially and make much more sense!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.