9

Has Kepler two times or four times the bandwidth of Fermi while accessing shared memory?

The Programming Guide states:

Each bank has a bandwidth of 32 bits per two clock cycles

for 2.X, and

Each bank has a bandwidth of 64 bits per clock cycle

for 3.X, so four times higher bandwidth is implied?

2
  • @Tom: I'm quoting from 5.0 PG. (Indeed the sentence about 3.X bandwidth has been added w.r.t 4.2 PG.). In both cases there are 32 banks. My question is in part due to p81 of this presentation where they say bandwidth is x2. I do not see any references to different clocks on 2.X and 3.X, and I trust when "clock cycle" is used, it means the same on all compute capabilities (as, e.g. with instruction throughput too). What these clock cycles are in Hz is not relevant to this quesiton. Commented Sep 10, 2012 at 19:27
  • The clock frequency is fundamental since you're talking about bandwidths which are typically measured in bytes/sec, going from bytes/cycle to bytes/sec requires clock frequency. I agree the doc is unclear, and hoping the CUDA 5.0 final release will be improved (the version you have is presumably from the release candidate). Commented Sep 10, 2012 at 21:25

2 Answers 2

9

On Fermi, each SM has 32 banks delivering 32 bits on every two clock cycles.

On Kepler, each SMX has 32 banks delivering 64 bits on every clock cycle. However since Kepler's SMX was fundamentally redesigned to be energy efficient, and since running fast clocks draws a lot of power, Kepler operates from a much slower core clock. Check out the Inside Kepler talk from GTC, about 8 minutes in, for more information.

So the answer to the question is that Kepler has ~2x, not 4x.

The next version of the documents (CUDA 5.0) should explain this better.

Sign up to request clarification or add additional context in comments.

1 Comment

I'm starting to see your point. SPs on 3.X do run on the primary GPU clock, while on 2.X they run on the shader clock, which was 2x primary GPU clock. So on Kepler it is "per primary clock cycle", and on Fermi it was "per two shader clock cycles" (= per one primary clock cycle). Access is therefore equally frequent from primary GPU clock perspective, and the 2x bandwidth comes from broader 64-bit words. This is also reflected in "SMX Processing Core Architecture" of the Kepler Whitepaper. Good to learn something about the SPs clock rate than! Thanks!
1

As given in

Programming Guide 4.2:

Shared memory has 16 banks that are organized such that successive 32-bit words map to successive banks. Each bank has a bandwidth of 32 bits per two clock cycles.

Kepler Whitepaper:

The shared memory bandwidth for 64b and larger load operations is also doubled compared to the Fermi SM, to 256B per core clock.

For small load operations, 4 times higher bandwidth it is.

7 Comments

@Tom: If the number of banks are same, if it takes 32b/2cc on Fermi and 64b/1cc in Kepler, its mathematically 4X. Need more explanation on the logic.
The question is asking to compare Fermi (2.x) and Kepler (3.x). The quote from the Programming guide about 16 banks is actually in the 1.x section. Kepler vs Fermi is 2x.
Don't forget that the Kepler clock is slower to conserve energy (see the video I linked to in my answer).
@Tom: In programming guide 4.2, F.5, it hasn't mentioned anything about the bandwidth. Please specify the citation you are providing from.(Section Number of the guide).
Your confusion arises because the "per two clock cycles" is referring to the "shader clock" in the GeForce paper you referenced. Kepler does not have the 2x clock. For compute (as opposed to graphics) the docs use the 2x clock for 1.x and 2.x devices, but Kepler has eliminated the 2x clock (as described in the video, approx 8 minutes in).
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.