0

Problem: I get for two calculated Array and two expecting outputs

  1. Right calculated output
  2. Random numbers, old number, numbers from the other array

I'm using MATLAB R2016B and this Coda version + GPU:

 CUDADevice with properties:

                  Name: 'GeForce GT 525M'
                 Index: 1
     ComputeCapability: '2.1'
        SupportsDouble: 1
         DriverVersion: 8
        ToolkitVersion: 7.5000
    MaxThreadsPerBlock: 1024
      MaxShmemPerBlock: 49152
    MaxThreadBlockSize: [1024 1024 64]
           MaxGridSize: [65535 65535 65535]
             SIMDWidth: 32
           TotalMemory: 1.0737e+09
       AvailableMemory: 947929088
   MultiprocessorCount: 2
          ClockRateKHz: 1200000
           ComputeMode: 'Default'
  GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
      CanMapHostMemory: 1
       DeviceSupported: 1
        DeviceSelected: 1

I will now try to add and subtract two different arrays using the GPU and returning it back to MATLAB.

MATLAB code:

n = 10;
as = [1,1,1];
bs = [10,10,10];

for i = 2:n+1
  as(end+1,:) = [i,i,i];
  bs(end+1,:) = [10,10,10];
end
as = as *1;

% Load the kernel
cudaFilename = 'add2.cu';
ptxFilename = ['add2.ptx'];

% Check if the files are awareable
if((exist(cudaFilename, 'file') || exist(ptxFilename, 'file')) == 2)
  error('CUDA FILES ARE NOT HERE');
end
kernel = parallel.gpu.CUDAKernel( ptxFilename, cudaFilename );

% Make sure we have sufficient blocks to cover all of the locations
kernel.ThreadBlockSize = [kernel.MaxThreadsPerBlock,1,1];
kernel.GridSize = [ceil(n/kernel.MaxThreadsPerBlock),1];

% Call the kernel
outadd = zeros(n,1, 'single' );
outminus = zeros(n,1, 'single' );
[outadd, outminus] = feval( kernel, outadd,outminus, as, bs );

Cuda snippet

#include "cuda_runtime.h"
#include "add_wrapper.hpp"
#include <stdio.h>

__device__ size_t calculateGlobalIndex() {
    // Which block are we?
    size_t const globalBlockIndex = blockIdx.x + blockIdx.y * gridDim.x;
    // Which thread are we within the block?
    size_t const localThreadIdx = threadIdx.x + blockDim.x * threadIdx.y;
    // How big is each block?
    size_t const threadsPerBlock = blockDim.x*blockDim.y;
    // Which thread are we overall?
    return localThreadIdx + globalBlockIndex*threadsPerBlock;
}

__global__ void addKernel(float *c, float *d, const float *a, const  float *b)
{
    int i = calculateGlobalIndex();
    c[i] = a[i] + b[i];
    d[i] = a[i] - b[i];
}

// C = A + B
// D = A - B
void addWithCUDA(float *cpuC,float *cpuD, const float *cpuA, const float *cpuB, const size_t sz)
{   
//TODO: add error checking

// choose which GPU to run on
cudaSetDevice(0);

// allocate GPU buffers
float *gpuA, *gpuB, *gpuC, *gpuD;
cudaMalloc((void**)&gpuA, sz*sizeof(float));
cudaMalloc((void**)&gpuB, sz*sizeof(float));
cudaMalloc((void**)&gpuC, sz*sizeof(float));
cudaMalloc((void**)&gpuD, sz*sizeof(float));
cudaCheckErrors("cudaMalloc fail");

// copy input vectors from host memory to GPU buffers
cudaMemcpy(gpuA, cpuA, sz*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(gpuB, cpuB, sz*sizeof(float), cudaMemcpyHostToDevice);

// launch kernel on the GPU with one thread per element
addKernel<<<1,sz>>>(gpuC, gpuD, gpuA, gpuB);

// wait for the kernel to finish
cudaDeviceSynchronize();

// copy output vector from GPU buffer to host memory
cudaMemcpy(cpuC, gpuC, sz*sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(cpuD, gpuD, sz*sizeof(float), cudaMemcpyDeviceToHost);


// cleanup
cudaFree(gpuA);
cudaFree(gpuB);
cudaFree(gpuC);
cudaFree(gpuD);
}

void resetDevice()
{
    cudaDeviceReset();
}

[outadd, outminus] are two GPU array Objects in MATLAB after running the code.

Outadd is always correctly calculated, outminus is rarely correct, mostly containing a random integer or floats, zeros or even the values of outadd sometime.

If i swap the order of the arithmetic operations it works out for the other one so isn't 'outminus' supposed to be calculated correctly?

6
  • Welcome to Stack Overflow. It seems you forgot to ask a question. Questions are denoted by a question mark (?) and can receive an answer. Please edit your post to contain a question, because it otherwise seems a good effort! Commented Dec 22, 2016 at 18:36
  • kernel.MaxThreadsPerBlock is 1024. Since n is 10, your kernel will launch 1 block of 1024 threads, even though you only need 10. Those extra threads may access your arrays out-of-bounds, so you should pass n as a scalar parameter to the kernel, and in your kernel you should test i against n. You might want to study this matlab example here. Commented Dec 22, 2016 at 18:36
  • @Robert Crovella I think I just got stuck here and I'll restart using limit threads. Thank you! Commented Dec 22, 2016 at 18:48
  • also, aren't most matlab variables double type? I think you are passing as and bs as double type, but your kernel is expecting float (ie. single) type there. Also, since you are using the PTX interface, I'm fairly certain that your CUDA file need not have any host code in it, i.e. your addWithCUDA function is completely extraneous/superfluous. Commented Dec 22, 2016 at 18:52
  • @Robert Crovella I just used all that Threads which index i < n and now it works! Thank you !! Commented Dec 22, 2016 at 19:06

1 Answer 1

1

Using @Robert Crovella hint that unnecessary threads might cause access errors, I simply added an limit for the Threads.

MATLAB

[outadd, outminus] = feval( kernel, outadd,outminus, as, bs, n);

CUDA KERNEL METHOD

__global__ void addKernel(float *c, float *d, const float *a, const float *b, const float n)
{  
    int i = calculateGlobalIndex();
    if ( i < n ){
        c[i] = a[i] + b[i];
        d[i] = a[i] - b[i];
   }
}

I think it's still not the best solution because the GPU still starts all Threads even if the most shouldn't use that much resources.

After reworking it in a proper way, I'll upload it here.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.