I meet the same problem with this guy
Memcpy from PCIe memory takes more time than memcpy to PCIe memory
Yes, MMIO read is very slow at x86/x86-64, only create/send the 32 or 64 bits of TLP payload size. Seems there is no solution to increase the MMIO payload size, and only change to use the DMA
then I research the PCIe DMA, I also meet the same problem with this guy
DMA transfer to a slave PCI device in Linux
PCIe spec(MSI/MSIx) seems doesn't have a common register to get the src/dst address(bar offset) and length(size) setting, each chip has their special register offset. So I didn't get the answer at the question(8273568). I feel PCIe is very popular interface. Is there anyway to use the dma access like follow PCIe BAR to MMIO access the memory? Or is there anyway to easy find the dma src/dst configuration offset of BAR from pcie info?
PS: I try to use the async_memcpy of Linux kernel to do the DMA memcpy, but seems it's not work. trace code to know, the async_tx_find_channel(dma_find_channel) function just read the NULL pointer on my CPU(s). then back to the original memcpy function. Did I miss any hardware/software setting before?