1

For practice, I'm working on making a simple matrix initialization program in cuda. I made a little sequential version for reference as a starting point. It just creates an n by m array and fills it with doubles. I've been reading through other posts and documentation, but I'm pretty confused and I was hoping someone could explain to me how to initialize a 2d array in cuda in a similar manner of n by m size as I did below. I would also appreciate insight on how to fill that cuda matrix if anyone would be willing to explain.

Hi again, in regards to it being a possible duplicate, I should elaborate. The linked post doesn't really explain anything, it's just sample code and it's one of the posts that I previously viewed but don't understand because it isn't explained. Thank you.

Sequential version:

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <assert.h>

int n,m, i, j;
double count;

void update(int n, int m, double arr[][m]){
  for(i=0; i<n; i++){
    for(j=0; j<m; j++){
      count++;
      arr[i][j] = count;
    }
  }
}


int main(int argc, char * argv[]) {
  assert(argc==3);
  n = atoi(argv[2]);
  m = atoi(argv[1]);

  double (*arr)[n][m] = malloc(sizeof *arr);
  update(n,m,arr);
  return 0;
}
11
  • 2
    double (*arr)[m] = malloc(n * sizeof *arr); – I don't think that does what you think it does. And why all variables global?? Commented Nov 28, 2018 at 6:27
  • 2
    Detail; In C, arr in double (*arr)[m] is not a 2D array, but a pointer to an array m of doubles. double arr[n][m] is a 2D array. Commented Nov 28, 2018 at 6:27
  • 3
    @chux pssst ... pointer to array of m doubles. Commented Nov 28, 2018 at 6:29
  • 1
    To create a pointer to a 2D array, C code could use double (*arr)[n][m] = malloc(sizeof *arr); Commented Nov 28, 2018 at 6:31
  • 1
    @Swordfish On 2nd thought, perhaps OP does want double (*arr)[m] = malloc(n * sizeof *arr); even if it is not a proper 2D array. Hmmm its late. Commented Nov 28, 2018 at 6:38

1 Answer 1

1

You can simulate 2D array in 1D, keeping data row by row. So that 2D array: [a,b][c,d] becomes [a,b,c,d]. To make things simple you can write a wrapper class providing such functionality.

Here is the demo (not 100% disasterproof, but working) of this idea

#pragma once
#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

typedef int TYPE;

// NOTE: copy consturctor and = operator need to be overloaded as well
template<class T>
struct Matrix
{
    Matrix(int r, int c) : rows(r), cols(c) {
        data = new T[r*c];
    }
    ~Matrix() {
        // As we allocated memory it needs to be freed upon destruction
        delete[] data;
        data = nullptr;
    }
    int rows, cols;
    T* data;
    T* operator[](int row) {
        // Returns pointer to "ROW", further call to [] on result will retrieve item at column in this row
        return data + (row*cols);
    }
};

// Simple cuda kernel 
__global__ void add(TYPE *a, TYPE *b, TYPE *c, int rows, int cols) {
    // Get element row and col
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    // If kernel block/grid is not sized perfectly make sure not to step outside data bounds
    if(row < rows && col < cols)
    {
        int idx = row*cols + col;
        c[idx] = a[idx] + b[idx];
    }
}

int main() {
    // m3 = m1 + m2 using cuda
    int rows = 5, cols = 5, total = rows * cols;
    Matrix<TYPE> m1{ rows,cols }, m2{ rows,cols }, m3{ rows,cols };

    // Initialization as 1D array
    for(int i = 0; i < total; i++)  {
        m1.data[i] = i;
    }

    // Or initialization as 2D array
    for(int r = 0; r < rows; r++)
        for(int c = 0; c < cols; c++)
            m2[r][c] = r*cols + c + 100;

    for(int i = 0; i < total; i++)  std::cout << m1.data[i] << ", ";
    std::cout << "\n";

    for(int r = 0; r < rows; r++) {
        for(int c = 0; c < cols; c++) 
            std::cout << m2[r][c] << ", ";
        std::cout << "\n";
    }

    // CUDA part
    TYPE *d_m1, *d_m2, *d_m3;

    // Allocation
    cudaMalloc((void **) &d_m1, total * sizeof(TYPE));
    cudaMalloc((void **) &d_m2, total * sizeof(TYPE));
    cudaMalloc((void **) &d_m3, total * sizeof(TYPE));

    // Copy m1 and m2 to GPU
    cudaMemcpy(d_m1, m1.data, total * sizeof(TYPE), cudaMemcpyHostToDevice);
    cudaMemcpy(d_m2, m2.data, total * sizeof(TYPE), cudaMemcpyHostToDevice);

    // Oversized on purpose to show row/col guard on add kernel
    dim3 grid(5, 5);
    dim3 block(5, 5);
    add <<< grid, block >>> (d_m1, d_m2, d_m3, rows, cols);

    // Copy result to m3
    cudaMemcpy(m3.data, d_m3, total * sizeof(TYPE), cudaMemcpyDeviceToHost);

    cudaFree(d_m1);
    cudaFree(d_m2);
    cudaFree(d_m3);

    for(int r = 0; r < rows; r++) {
        for(int c = 0; c < cols; c++)
            std::cout << m3[r][c] << ", ";
        std::cout << "\n";
    }

    system("pause");
    return 0;
}
Sign up to request clarification or add additional context in comments.

1 Comment

I don't understand the logic of going to the trouble of writing a wrapper class and not using it on both the host and device. Further the OP is complaining that the many other answers to this question already on Stack Overflow are apparently too hard to understand because of limited explanation. Your answer suffers from the same problem

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.