For CUDA, I know they are executed asynchronously after issuing the launch commands to the default stream(null stream), so how about that in OpenCL? Sample codes are as follows:
cl_context context;
cl_device_id device_id;
cl_int err;
...
cl_kernel kernel1;
cl_kernel kernel2;
cl_command_queue Q = clCreateCommandQueue(context, device_id, 0, &err);
...
size_t global_w_offset[3] = {0,0,0};
size_t global_w_size[3] = {16,16,1};
size_t local_w_size[3] = {16,16,1};
err = clEnqueueNDRangeKernel(Q, kernel1, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
err = clEnqueueNDRangeKernel(Q, kernel2, 3, global_w_offset, global_w_size,
local_w_size, 0, nullptr, nullptr);
clFinish(Q);
Will kernel1 and kernel2 be executed asynchronously after commands enqueued?(i.e. executions overlap)
Update
According to the OpenCL Reference, It seems set properties as CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE in clCreateCommandQueue can meet my need. But, Does out_of_order mean asynchronization?