8

C++ uses global operator new to allocate coroutines by default. But this potentially leaves a lot of performance on the floor compared to Rust, which can stack allocate them. This is disappointing because for most coroutine use this would be fine -- often you co_await a coroutine right when creating it, or co_await a join/combinator of it and several others immediately. You can override the operator new and operator delete for a promise type and create a custom allocator that does does strict FIFO allocation over some preallocated heap area, but it would still generally be better to reuse the thread's already existing, already hot in cache stack.

AFAICT it is impossible to use alloca on the fly for this -- any call to it in operator new would be freed when the operator function returns. You could preallocate a big chunk of space with alloca in some top level function and then define operator new for the promise type to allocate out of that region, but this is effectively the same as having the separate heap allocated area from a cache-hotness perspective (all of your coroutines using a separate special otherwise cold area instead of being intermingled with your regular calls using the real top of the stack).

Is there any way to make alloca work?

There is a related question here about whether you can call alloca inside a coroutine, but I am asking about using it to back the allocation of the coroutine itself (which necessarily happens outside it) before running it.

There is also this question that open endedly asks if stackless C++ coroutines are a problem, where some answers try to justify the design but doesn't mention alloca at all and doesn't address that Rust model is an existence proof for it being possible in principle to use stack allocations.

21
  • Use placement new to allocate the coroutine in the space you allocate with alloca()? Commented Jan 16 at 18:44
  • 1
    @Barmar: but you don't know how much space you need until operator new is called Commented Jan 16 at 19:00
  • You can work out how much space you need with a combination of sizeof and alignof, no? Commented Jan 16 at 20:03
  • @catnip: Nope, the type of the coroutine frame is compiler internal and there is no way to refer to it even with decltype. The task type coroutines return is a wrapper around a cororoutine_handle, which is a type erased pointer to the coroutine frame. The coroutine/co_await machinery only ever delivers you coroutine_handles. Commented Jan 16 at 20:35
  • 1
    @JosephGarvin what I meant is that I see two approaches to obtain what you want: the first would be to move alloca on the calling function, and pass the pointer to the operator new. This would imply knowing the size required before the call to the operator new. The second approach would be to inline the function calling alloca, so the operator new. Usually this is prevented by the compiler, but there may be compiler attributes that allow you that. For example, look for always_inline here: github.com/gcc-mirror/gcc/blob/master/gcc/tree-inline.cc Commented Jan 22 at 16:52

2 Answers 2

4

Actually, yes, this is possible, but you must be careful to ensure that the lifetime of the coroutine ends before returning from the function that called alloca. For simple synchronous generator coroutines that you fully use up in the same function, that's easy enough.

The trick is to take advantage of the fact that when you provide a custom operator new on the coroutine's promise type, the compiler tries to pass in all the parameters the coroutine received. You can designate a parameter as a special allocation parameter and use it to communicate with the allocation code in operator new. The compiler provides the required size to both operator new and operator delete, and if operator new returns nullptr then the compiler uses get_return_object_on_allocation_failure to initialize the return value.

Putting this all together, we can make a coroutine promise type that allows invoking the coroutine in a two-step approach, the first time to get the required buffer size, and the second time to actually allocate into the buffer, similar to many C APIs. Here's a basic example which allows the allocation parameter to be nullptr when dynamic allocation is preferred, so you can choose which uses of the coroutine should use alloca or not:

#include <coroutine>
#include <cstddef>
#include <memory>
#include <new>
#include <print>
#include <span>
#include <utility>

#if defined(_WIN32)
#include <malloc.h>
#define alloca _alloca
#else
#include <alloca.h>
#endif

struct AllocationParam
{
    std::span<std::byte> buffer{};
    std::size_t used{};
};

template<typename T>
class Generator
{
public:
    class promise_type;
private:
    using handle_type = std::coroutine_handle<promise_type>;
    handle_type handle{};
public:
    class promise_type
    {
        T* value{};
        friend Generator;
    public:
        constexpr promise_type() noexcept = default;
        promise_type(promise_type const&) = delete;
        promise_type(promise_type&&) = delete;
        promise_type& operator=(promise_type const&) = delete;
        promise_type& operator=(promise_type&&) = delete;

        [[nodiscard]] handle_type get_return_object() noexcept
        {
            return handle_type::from_promise(*this);
        }
        [[nodiscard]] static constexpr handle_type get_return_object_on_allocation_failure() noexcept
        {
            return {};
        }
        [[nodiscard]] static void* operator new(std::size_t const length, auto&&... params) noexcept
        {
            AllocationParam* param{};
            (([&]() noexcept //this lambda is run for each of params to find the one we want
            {
                if constexpr(std::is_same_v<AllocationParam*, std::remove_cvref_t<decltype(params)>>)
                {
                    param = params;
                }
            }()), ...);
            if(param)
            {
                param->used = length + 1;
                if(std::size(param->buffer) >= param->used)
                {
                    param->buffer[length] = std::byte{0}; //un-set this byte to note that we didn't dynamically allocate
                    return std::data(param->buffer);
                }
                return nullptr;
            }
            std::byte* const memory{static_cast<std::byte*>(::operator new(length + 1, std::nothrow))};
            if(memory)
            {
                memory[length] = std::byte{1}; //set this byte to 1 to note that we dynamically allocated
            }
            return memory;
        }
        static void operator delete(void* const ptr, std::size_t const length) noexcept
        {
            std::byte* const memory{static_cast<std::byte*>(ptr)};
            if(memory[length] == std::byte{1})
            { //this memory was dynamically allocated
                return ::operator delete(memory, length + 1);
            }
        }

        [[nodiscard]] constexpr std::suspend_always initial_suspend() const noexcept { return {}; }
        [[nodiscard]] constexpr std::suspend_always final_suspend() const noexcept { return {}; }
        void unhandled_exception() const { throw; }
        [[nodiscard]] constexpr std::suspend_always yield_value(T&& t) noexcept
        {
            value = std::addressof(t);
            return {};
        }
        constexpr void return_void() const noexcept {}
    };

    constexpr Generator() noexcept = default;
    constexpr Generator(handle_type const h) noexcept
    : handle{h}
    {
    }
    Generator(Generator const&) = delete;
    Generator& operator=(Generator const&) = delete;
    constexpr Generator(Generator&& f) noexcept
    {
        using std::swap;
        swap(handle, f.handle);
    }
    constexpr Generator& operator=(Generator&& f) noexcept
    {
        using std::swap;
        swap(handle, f.handle);
        return *this;
    }
    ~Generator() noexcept
    {
        if(handle)
        {
            handle.destroy();
        }
    }

    [[nodiscard]] constexpr operator bool() const noexcept
    {
        return handle && !handle.done();
    }

    [[nodiscard]] T* operator()()
    {
        if(handle && !handle.done())
        {
            handle.resume();
            return std::exchange(handle.promise().value, nullptr);
        }
        return nullptr;
    }
};

[[nodiscard]] Generator<int> example(AllocationParam* const = nullptr)
{
    co_yield 1;
    co_yield 2;
    co_yield 2;
    co_yield 5;
}

int main()
{
    AllocationParam param{};
    Generator<int> g(example(&param));
    if(g)
    {
        std::println("coroutine allocation was elided");
    }
    else
    {
        std::println("coroutine allocation requested {} bytes", param.used);
        void* const memory{alloca(param.used)};
        param.buffer = {static_cast<std::byte*>(memory), param.used};
        param.used = 0;
        g = example(&param);
        if(g && param.used == 0)
        {
            std::println("coroutine allocation was elided after alloca");
        }
        if(!g)
        {
            std::println("coroutine allocation failed, requested {} bytes", param.used);
        }
    }
    while(int* const v{g()})
    {
        std::println("generator yielded: {}", *v);
    }
}

Notice that each time we call the coroutine function (example in this case) there is a potential for the compiler to decide to elide the allocation, or to request a different amount of memory each time. In practice I have not seen this happen, but it's worth being aware that it's a possibility just in case. In my tests with MSVC, debug builds request 369 bytes and release builds request 81 bytes, so there's a noticable bloat in coroutine frame size with debug builds especially, even with simple coroutines such as this.

Note that use of alloca is not required here, another possibility with this code is that you can re-use memory across multiple coroutines this way, as long as the coroutine lifetimes do not overlap. You can allocate that memory by any means you like - it could be a static or thread local global buffer, or it can be a dynamically allocated buffer that you keep resizing until it no longer receives resize requests, etc.

Sign up to request clarification or add additional context in comments.

Comments

2

Coroutine state is allocated with the promise class's operator new if one is defined, and with the global new if not.

Obviously calling alloca inside any operator new implementation would be unhelpful, and there's no other explicit view on the allocation which underlies returning a coroutine handle.

However, you may be interested to read on cppreference that

The call to operator new can be optimized out (even if custom allocator is used) if

  • The lifetime of the coroutine state is strictly nested within the lifetime of the caller, and
  • the size of coroutine frame is known at the call site.

In that case, coroutine state is embedded in the caller's stack frame (if the caller is an ordinary function) or coroutine state (if the caller is a coroutine).

which suggests that a good implementation doesn't need any explicit code or language support to permit the optimization you want.

2 Comments

In practice the requirements are stricter: the compiler has to have access to the entire definition and inline the coroutine. You can see here that even on lastest GCC and Clang adding a noinline attribute is sufficient to defeat HALO. In practice it's rare that your entire stack of calls are all inline. godbolt.org/z/dqxPanv31
Surely having access to the entire definition (and the ability to inline it) is exactly what is required anyway to know the size of the coroutine state. This is stricter only when your visible implementation is barred from inlining.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.