Problem: I get for two calculated Array and two expecting outputs
- Right calculated output
- Random numbers, old number, numbers from the other array
I'm using MATLAB R2016B and this Coda version + GPU:
CUDADevice with properties:
Name: 'GeForce GT 525M'
Index: 1
ComputeCapability: '2.1'
SupportsDouble: 1
DriverVersion: 8
ToolkitVersion: 7.5000
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [65535 65535 65535]
SIMDWidth: 32
TotalMemory: 1.0737e+09
AvailableMemory: 947929088
MultiprocessorCount: 2
ClockRateKHz: 1200000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceSelected: 1
I will now try to add and subtract two different arrays using the GPU and returning it back to MATLAB.
MATLAB code:
n = 10;
as = [1,1,1];
bs = [10,10,10];
for i = 2:n+1
as(end+1,:) = [i,i,i];
bs(end+1,:) = [10,10,10];
end
as = as *1;
% Load the kernel
cudaFilename = 'add2.cu';
ptxFilename = ['add2.ptx'];
% Check if the files are awareable
if((exist(cudaFilename, 'file') || exist(ptxFilename, 'file')) == 2)
error('CUDA FILES ARE NOT HERE');
end
kernel = parallel.gpu.CUDAKernel( ptxFilename, cudaFilename );
% Make sure we have sufficient blocks to cover all of the locations
kernel.ThreadBlockSize = [kernel.MaxThreadsPerBlock,1,1];
kernel.GridSize = [ceil(n/kernel.MaxThreadsPerBlock),1];
% Call the kernel
outadd = zeros(n,1, 'single' );
outminus = zeros(n,1, 'single' );
[outadd, outminus] = feval( kernel, outadd,outminus, as, bs );
Cuda snippet
#include "cuda_runtime.h"
#include "add_wrapper.hpp"
#include <stdio.h>
__device__ size_t calculateGlobalIndex() {
// Which block are we?
size_t const globalBlockIndex = blockIdx.x + blockIdx.y * gridDim.x;
// Which thread are we within the block?
size_t const localThreadIdx = threadIdx.x + blockDim.x * threadIdx.y;
// How big is each block?
size_t const threadsPerBlock = blockDim.x*blockDim.y;
// Which thread are we overall?
return localThreadIdx + globalBlockIndex*threadsPerBlock;
}
__global__ void addKernel(float *c, float *d, const float *a, const float *b)
{
int i = calculateGlobalIndex();
c[i] = a[i] + b[i];
d[i] = a[i] - b[i];
}
// C = A + B
// D = A - B
void addWithCUDA(float *cpuC,float *cpuD, const float *cpuA, const float *cpuB, const size_t sz)
{
//TODO: add error checking
// choose which GPU to run on
cudaSetDevice(0);
// allocate GPU buffers
float *gpuA, *gpuB, *gpuC, *gpuD;
cudaMalloc((void**)&gpuA, sz*sizeof(float));
cudaMalloc((void**)&gpuB, sz*sizeof(float));
cudaMalloc((void**)&gpuC, sz*sizeof(float));
cudaMalloc((void**)&gpuD, sz*sizeof(float));
cudaCheckErrors("cudaMalloc fail");
// copy input vectors from host memory to GPU buffers
cudaMemcpy(gpuA, cpuA, sz*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(gpuB, cpuB, sz*sizeof(float), cudaMemcpyHostToDevice);
// launch kernel on the GPU with one thread per element
addKernel<<<1,sz>>>(gpuC, gpuD, gpuA, gpuB);
// wait for the kernel to finish
cudaDeviceSynchronize();
// copy output vector from GPU buffer to host memory
cudaMemcpy(cpuC, gpuC, sz*sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(cpuD, gpuD, sz*sizeof(float), cudaMemcpyDeviceToHost);
// cleanup
cudaFree(gpuA);
cudaFree(gpuB);
cudaFree(gpuC);
cudaFree(gpuD);
}
void resetDevice()
{
cudaDeviceReset();
}
[outadd, outminus] are two GPU array Objects in MATLAB after running the code.
Outadd is always correctly calculated, outminus is rarely correct, mostly containing a random integer or floats, zeros or even the values of outadd sometime.
If i swap the order of the arithmetic operations it works out for the other one so isn't 'outminus' supposed to be calculated correctly?
kernel.MaxThreadsPerBlockis 1024. Sincenis 10, your kernel will launch 1 block of 1024 threads, even though you only need 10. Those extra threads may access your arrays out-of-bounds, so you should passnas a scalar parameter to the kernel, and in your kernel you should testiagainstn. You might want to study this matlab example here.doubletype? I think you are passingasandbsas double type, but your kernel is expectingfloat(ie.single) type there. Also, since you are using the PTX interface, I'm fairly certain that your CUDA file need not have any host code in it, i.e. youraddWithCUDAfunction is completely extraneous/superfluous.