0

I meet the same problem with this guy
Memcpy from PCIe memory takes more time than memcpy to PCIe memory

Yes, MMIO read is very slow at x86/x86-64, only create/send the 32 or 64 bits of TLP payload size. Seems there is no solution to increase the MMIO payload size, and only change to use the DMA

then I research the PCIe DMA, I also meet the same problem with this guy
DMA transfer to a slave PCI device in Linux

PCIe spec(MSI/MSIx) seems doesn't have a common register to get the src/dst address(bar offset) and length(size) setting, each chip has their special register offset. So I didn't get the answer at the question(8273568). I feel PCIe is very popular interface. Is there anyway to use the dma access like follow PCIe BAR to MMIO access the memory? Or is there anyway to easy find the dma src/dst configuration offset of BAR from pcie info?

PS: I try to use the async_memcpy of Linux kernel to do the DMA memcpy, but seems it's not work. trace code to know, the async_tx_find_channel(dma_find_channel) function just read the NULL pointer on my CPU(s). then back to the original memcpy function. Did I miss any hardware/software setting before?

4
  • You can do 128- and IIRC 256-bit IO with vector instructions. Commented Dec 29, 2024 at 9:10
  • But seems intel MMIO table only fill 4/8 bytes in MMIO read. So you mean I still have a way to increase the payload size? Commented Dec 30, 2024 at 3:04
  • Start from here stackoverflow.com/questions/74082159/…? Or here stackoverflow.com/questions/51918804/…? Commented Dec 30, 2024 at 7:26
  • Thank you very much, I use AVX512(%%zmm) with kernel asm(vmovdqa32) macro. which improve the MMIO read performance from 20MB/s to 40MB/s. But I feel not enough, my vga is PCIe gen4x8, so I am keep studying the DMA how to work. PS: 128bit(xmm) only 10MB/s, 256bit(ymm) and memcpy both are 20MB/s, 512bit has 40MB/s PS2: if I use VMOVNTDQA, which over 100MB/s, but when I change the pcie gen, the performance doesn't change. Commented Jan 2 at 3:04

1 Answer 1

0

It sounds like you want to asynchronously copy from/to device A memory to/from system memory using DMA capabilities of device B (perhaps PCIe root complex). I don't know if that is even possible, and certainly address mapping will be in general non-trivial due to IOMMU (see second link below). async_memcpy module supports Linux soft-RAID and copies system memory to system memory.

PCI BARs are located at fixed addresses (0x10-0x24) in PCI configuration space, and there is a standard procedure for finding the size of memory regions they describe (read initial value, write all ones, read back, compare). This is what allows BIOS and/or OS to enumerate all devices without any device-specific knowledge and to map any I/O memory they may provide into the common address space. Linux kernel reads and/or performs this mapping at startup. /proc/iomem lists it in address order and lspci -v prints it per device. A given PCI device may or may not be capable of issuing DMA requests to copy data to/from system memory from/to its internal memory by itself; that's device-specific and the way to do it is device-specific. Drivers take care of it. There is apparently a kernel-wide list of DMA devices (struct dma_device) and per-device DMA channels built by individual devices' drivers, which provides a generic interface to devices' DMA capabilities, so you may want to check if your device driver supports this.

You may find this question and answer and this description of Linux DMA API useful.

Sign up to request clarification or add additional context in comments.

8 Comments

If PCI device can be a bus master it has to support DMA. For example, any device that supports MSI/MSI-X implies to support DMA.
Sure, I assume that's what struct dma_device is about. But the way I understood the question is ${first paragraph of my answer} and that's a bit different, isn't it?
No, the m ntioned structure is for GPDMA, while here we have PCIe DMA. For the latter the driver is the (host) bridge one in the drivers/pci/controller.
Sorry for not clear. Might I split the async_memcpy part to 2 questions. Refer the kernel Document/crypto/async-tx-api.rst. (1) Did this function must have Intel(R) Xscale or x86 with IOAT processors hardware? My 2 machines both are failed. INTEL(R) XEON(R) GOLD 6526Y and AMD Ryzen 5 7600. Even this not exist in amd, but intel xeon should has that. I feel this is an old intel feature. If (1 != TRUE), I should get a DMA channel at the first step, but printk *chan is NULL.
@AntonTykhyy, correct. This is different case. The problem here is that the HW may not have a programming interface for that DMA as it’s completely private to the device in question, and the question is not clear about that at all. So I vote to close it.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.