CUDA copy linked lists from device to host

Question

I am trying to populate a number of linked lists on the device and then return those lists back to the hosts.

From my understanding I need to allocate memory for my struct Element, but I don't know how to go about it since I will have many linked lists, each with an unknown number of elements. I've tried a couple of different things but it still didn't work. So I'm back to the starting point. Here is my code:

//NODE CLASS
class Node{
public:
    int x,y;
    Node *parent;
    __device__ __host__ Node(){}
    __device__ __host__ Node(int cX, int cY){x = cX; y = cY;}
    __device__ __host__ int get_row() { return x; }
    __device__ __host__ int get_col() { return y; }
};

//LINKED LIST
class LinkedList{
public:
    __device__ __host__ struct Element{
            Node n1;
            Element *next;
    };

    __device__ __host__ LinkedList(){
        head = NULL;
    }

    __device__ __host__ void addNode(Node n){
        Element *el = new Element();
        el->n1 = n;

        el->next = head;
        head = el;
    }

    __device__ __host__ Node popFirstNode(){
        Element *cur = head;
        Node n;

        if(cur != NULL){
            n = cur -> n1;
            head = head -> next;
        }

        delete cur;
        return n;
    }
    __device__ __host__ bool isEmpty(){
        Element *cur = head;

        if(cur == NULL){
            return true;
        }

        return false;
    }

    Element *head;
};

//LISTS
__global__ void listsKernel(LinkedList* d_Results, int numLists){
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    Node n(1,1);

    if(idx < numLists){
        d_Results[idx].addNode(n);
        d_Results[idx].addNode(n);
        d_Results[idx].addNode(n);
        d_Results[idx].addNode(n);
    }
}

int main(){
    int numLists = 10;
    size_t size = numLists * sizeof(LinkedList); 
    LinkedList curList;

    LinkedList* h_Results = (LinkedList*)malloc(size);
    LinkedList* d_Results;
    cudaMalloc((void**)&d_Results, size);

    listsKernel<<<256,256>>>(d_Results, numLists);
    cudaMemcpy(h_Results, d_Results, sizeof(LinkedList)*numLists, cudaMemcpyDeviceToHost);

    for(int i = 0; i < numLists; i++){
        curList = h_Results[i];
        while(curList.isEmpty() == false){
            Node n = curList.popFirstNode();
            std::cout << "x: " << n.get_row() << " y: " << n.get_col();
        }
    }
}

As you can see I'm trying to populate 10 linked lists on the device and then return them back to the host, but the code above results in unhandled exception - Access violation reading location. I am assuming it is not coping the pointers from the device.

Any help would be great.

What is listsKernel? Where does the host access violation occur? — talonmies
– talonmies, Commented Mar 7, 2015 at 19:53
memory allocated with in-kernel new cannot take part in a device->host cudaMemcpy operation, although that is just one of several issues with your approach. This blog may be of interest. — Robert Crovella
– Robert Crovella, Commented Mar 7, 2015 at 22:48
One scheme that makes it trivial to copy lists between the host and the device is to use an array to store the list nodes, and use array indices rather that pointers for linkage. This also allows one to keep all list nodes in one contiguous storage block. This memory block can trivially be copied on the host, or between host and device, or send to and retrieved from mass storage, and all linkage between nodes remains valid. Storage can be extended later by allocating a larger memory block and copying the existing list there. Let me know if you would like me to post an answer along those lines. — njuffa
– njuffa, Commented Mar 8, 2015 at 0:01
I think the best bet you have is to rethink your approach and use Arrays of indices instead of pointers. Be aware of how warps work in Nvidia GPUs however - if your code is light on computations (which yours appears to be), reading memory in random access fashion will reduce your memory bandwidth a lot. So optimise your datastructures as much as possible according to your computations such that you can read out your data in coalesced fashion. — Michel Müller
– Michel Müller, Commented Mar 8, 2015 at 6:24
njuffa I would really like to see your solution. Thank you all for your comments. — Jakub
– Jakub, Commented Mar 8, 2015 at 10:35

user1084944 · Accepted Answer · 2015-03-07 22:34:59Z

1

Just eyeballing the code, it seems you have a fundamental misconception: there is host memory which cannot be accessed from the device, and device memory which cannot be accessed from the host. So when you create linked list nodes in device memory and copy the pointers back to the host, the host cannot dereference those pointers, because they are pointing to device memory.

If you truly want to pass linked lists back and forth between host and device, your best bet is probably to copy the entries into an array, do the memcpy, then copy the array back into a linked list. Other things can be done too, depending on just what your use case is.

(it is possible to allocate a region of memory that is accessible both from the host and from the device, but there is some awkwardness with it and I have no experience using it)

answered Mar 7, 2015 at 22:34

user1084944

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

CUDA copy linked lists from device to host

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related