Using std::vector in CUDA device code

Question

The question is that: is there a way to use the class "vector" in Cuda kernels? When I try I get the following error:

error : calling a host function("std::vector<int, std::allocator<int> > ::push_back") from a __device__/__global__ function not allowed

So there a way to use a vector in global section? I recently tried the following:

create a new Cuda project
go to properties of the project
open Cuda C/C++
go to Device
change the value in "Code Generation" to be set to this value: compute_20,sm_20

........ after that I was able to use the printf standard library function in my Cuda kernel.

is there a way to use the standard library class vector in the way printf is supported in kernel code? This is an example of using printf in kernel code:

// this code only to count the 3s in an array using Cuda
//private_count is an array to hold every thread's result separately 

__global__ void countKernel(int *a, int length, int* private_count) 
{
    printf("%d\n",threadIdx.x);  //it's print the thread id and it's working

    // vector<int> y;
    //y.push_back(0); is there a possibility to do this?

    unsigned int offset  = threadIdx.x * length;
    int i = offset;
    for( ; i < offset + length; i++)
    {
        if(a[i] == 3)
        {
            private_count[threadIdx.x]++;
            printf("%d ",a[i]);
        }
    }   
}

+1 perfectly legit question (not sure why it was voted down. Unfortunately the answer is currently no. — harrism
– harrism, Commented Apr 29, 2012 at 22:15
sorry for necroposting. Just wanna know if there is any answer to this right now — Aroli Marcellinus
– Aroli Marcellinus, Commented Oct 7, 2021 at 6:47

Edvard Pedersen · Accepted Answer · 2012-04-29 20:47:34Z

27

You can't use the STL in CUDA, but you may be able to use the Thrust library to do what you want. Otherwise just copy the contents of the vector to the device and operate on it normally.

answered Apr 29, 2012 at 20:47

Edvard Pedersen

7834 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

thatWiseGuy Over a year ago

I don't see how this is supposed to help, because a thrust::device_vector cannot be used inside kernels, either.

paleonix · Accepted Answer · 2023-05-14 19:40:22Z

14

In the CUDA library Thrust, you can use thrust::device_vector<classT> to define a vector on the device, and the data transfer between host STL vector and device_vector is very straightforward. You can refer to this useful link to find some useful examples.

Note however, that device_vector itself can not be used in device code either. Only its pointers/iterators can be used there.

edited May 14, 2023 at 19:40

paleonix

3,3255 gold badges20 silver badges42 bronze badges

answered May 3, 2013 at 16:02

Zhong Hu

2722 silver badges5 bronze badges

Comments

Humam Helfawi · Accepted Answer · 2016-08-29 19:10:38Z

7

you can't use std::vector in device code, you should use array instead.

edited Aug 29, 2016 at 19:10

Humam Helfawi

20.4k18 gold badges92 silver badges170 bronze badges

answered Apr 30, 2012 at 12:58

yyfn

7874 silver badges4 bronze badges

Comments

Robin Lew · Accepted Answer · 2017-08-31 23:45:23Z

I think you can implement a device vector by youself, because CUDA supports dynamic memory alloction in device codes. Operator new/delete are also supported. Here is an extremely simple prototype of device vector in CUDA, but it does work. It hasn't been tested sufficiently.

template<typename T>
class LocalVector
{
private:
    T* m_begin;
    T* m_end;

    size_t capacity;
    size_t length;
    __device__ void expand() {
        capacity *= 2;
        size_t tempLength = (m_end - m_begin);
        T* tempBegin = new T[capacity];

        memcpy(tempBegin, m_begin, tempLength * sizeof(T));
        delete[] m_begin;
        m_begin = tempBegin;
        m_end = m_begin + tempLength;
        length = static_cast<size_t>(m_end - m_begin);
    }
public:
    __device__  explicit LocalVector() : length(0), capacity(16) {
        m_begin = new T[capacity];
        m_end = m_begin;
    }
    __device__ T& operator[] (unsigned int index) {
        return *(m_begin + index);//*(begin+index)
    }
    __device__ T* begin() {
        return m_begin;
    }
    __device__ T* end() {
        return m_end;
    }
    __device__ ~LocalVector()
    {
        delete[] m_begin;
        m_begin = nullptr;
    }

    __device__ void add(T t) {

        if ((m_end - m_begin) >= capacity) {
            expand();
        }

        new (m_end) T(t);
        m_end++;
        length++;
    }
    __device__ T pop() {
        T endElement = (*m_end);
        delete m_end;
        m_end--;
        return endElement;
    }

    __device__ size_t getSize() {
        return length;
    }
};

Even if it is correct, it is likely to be slow because of the memory allocation inside expand() function. Great effort though.

einpoklum · Accepted Answer · 2024-03-12 09:43:46Z

2

You can't use std::vector in device-side code. Why?

It's not marked to allow this

The "formal" reason is that, to use code in your device-side function or kernel, that code itself has to be in a __device__ function; and the code in the standard library, including, std::vector is not. (There's an exception for constexpr code; and in C++20, std::vector does have constexpr methods, but CUDA does not support C++20 at the moment, plus, that constexprness is effectively limited.)

You probably don't really want to

The std::vector class uses allocators to obtain more memory when it needs to grow the storage for the vectors you create or add into. By default (i.e. if you use std::vector<T> for some T) - that allocation is on the heap. While this could be adapted to the GPU - it would be quite slow, and incredibly slow if each "CUDA thread" would dynamically allocate its own memory.

Now, you could say "But I don't want to allocate memory, I just want to read from the vector!" - well, in that case, you don't need a vector per se. Just copy the data to some on-device buffer, and either pass a pointer and a size, or use a CUDA-capable span, like in cuda-api-wrappers or cuda-kat.

Another option, though a bit "heavier", is to use the NVIDIA thrust library's device_vector class. Under the hood, it's quite different from the standard library vector though.

edited Mar 12, 2024 at 9:43

answered Feb 18, 2022 at 17:39

einpoklum

137k86 gold badges445 silver badges917 bronze badges

2 Comments

paleonix Over a year ago

device_vector can't be used in device code (its iterators can). By now there is an official implementation of span and mdspan in libcu++.

einpoklum Over a year ago

@paleonix : There is, but - I'm against libcu++ as a project. NVIDIA should not try to replace the standard library, nor does it make sense for people to try and use the standard library on CUDA devices. Only some small pieces of it, which should not pretend to be the full thing.

Collectives™ on Stack Overflow

Using std::vector in CUDA device code

5 Answers 5

1 Comment

Comments

Comments

1 Comment

It's not marked to allow this

You probably don't really want to

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

Comments

Comments

1 Comment

It's not marked to allow this

You probably don't really want to

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related