I want to quantitatively measure the memory bandwidth utilization and SM utilization of a CUDA program for performance analysis and regression testing.
My approach so far:
Compute the theoretical memory bandwidth:
BW_theoretical = mem_clock(Hz) * bus_width(bit) / 8 * 2
Inside the program, calculate actual bandwidth as (bytes read + bytes written) / elapsed time.
Use NVIDIA’s monitoring tool DCGM externally to observe memory bandwidth and utilization during the same program run, then compare the two results.
I expect the [bandwidth from program / BW_theoretical] should near to the DCGM_FI_PROF_DRAM_ACTIVE form dcgm.
Problem
I am using the DCGM metric DCGM_FI_PROF_DRAM_ACTIVE. But I observe that:
- The bandwidth measured inside the program (bytes/time) differs a lot from the value reported by DCGM.
My questions
Does DCGM_FI_PROF_DRAM_ACTIVE really represent memory bandwidth utilization? Or does it only indicate the percentage of cycles the DRAM is active (not equivalent to throughput)?
If I want to obtain bytes/sec throughput that can be compared directly with my in-program measurement, which DCGM metrics should I use instead? Or which tools could I used?