-3

I am currently writing a driver for the Intel ARC GPU series (specifically I use the A750 for testing purposes) for my own operating system.
I am already able to execute compute kernels that use bindless parameters, but as soon as I execute a kernel that accesses a bound untyped surface (surface type: BUFFER, surface format: RAW, surface pitch: 0 (1 byte), accessing send instruction: untyped surface read with x), the GPU hangs (probably because of a fault).
I have already checked the following things:

  • When I don't map the surface in the PPGTT, I get a pagefault. When I map it, I don't get one, hinting that binding table and at least part of the surface state is working.
  • The hang occurs even if I set all offsets in the send instructions payload to 0 and set the surface size to 0xFFFFFFFF, so I don't think the surface bounds check fails.
  • All addresses are 64KB aligned (for now), so aligning issues should not be a problem

I hope anybody has any idea what the cause of the hang could be, because I am running out of ideas on what to test.

8
  • Are you sure you haven't already answered your own question: the GPU hangs (probably because of a fault).? Commented Nov 26 at 1:03
  • 1
    Yeah, but I don't know which fault it could possibly be, because I've already ruled the only three faults out that can occur according to the PRM. And even if I did know which fault it is, I still wouldn't know why it occurs. Hence my question. Commented Nov 26 at 1:25
  • 4
    I haven't written GPU drivers myself, but I'd guess there's a 99% chance there's a bug in some code you're not showing, rather than some big-picture conceptual misunderstanding of how to talk to ARC GPUs. So it's very likely this question isn't answerable without a minimal reproducible example, and the best someone familiar with the hardware could say is that what you're doing should work. Maybe they could guess at how your driver works and point to some common gotchas, but this abstract debugging question doesn't seem like a good for SO. At link to a git repo wouldn't be sufficient, but would be better than nil Commented Nov 26 at 3:00
  • Can you run your OS in a VM that has direct access to that PCIe device, so you can debug it? Or it's just the GPU hanging, not the CPU, so your driver can detect that and reset the GPU, but can't get any other information from it about what it didn't like. Commented Nov 26 at 3:18
  • 4
    Please add the relevant code for that part of your driver to the question; even someone who 'knows Intel CPUs' will be hard pressed to give an a concrete answer without actually seeing your code. Commented Nov 26 at 17:09

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.