Shared memory bandwidth Fermi vs Kepler GPU

Question

Has Kepler two times or four times the bandwidth of Fermi while accessing shared memory?

The Programming Guide states:

Each bank has a bandwidth of 32 bits per two clock cycles

for 2.X, and

Each bank has a bandwidth of 64 bits per clock cycle

for 3.X, so four times higher bandwidth is implied?

@Tom: I'm quoting from 5.0 PG. (Indeed the sentence about 3.X bandwidth has been added w.r.t 4.2 PG.). In both cases there are 32 banks. My question is in part due to p81 of this presentation where they say bandwidth is x2. I do not see any references to different clocks on 2.X and 3.X, and I trust when "clock cycle" is used, it means the same on all compute capabilities (as, e.g. with instruction throughput too). What these clock cycles are in Hz is not relevant to this quesiton. — P Marecki
– P Marecki, Commented Sep 10, 2012 at 19:27
The clock frequency is fundamental since you're talking about bandwidths which are typically measured in bytes/sec, going from bytes/cycle to bytes/sec requires clock frequency. I agree the doc is unclear, and hoping the CUDA 5.0 final release will be improved (the version you have is presumably from the release candidate). — Tom
– Tom, Commented Sep 10, 2012 at 21:25

Tom · Accepted Answer · 2012-09-10 16:58:40Z

9

On Fermi, each SM has 32 banks delivering 32 bits on every two clock cycles.

On Kepler, each SMX has 32 banks delivering 64 bits on every clock cycle. However since Kepler's SMX was fundamentally redesigned to be energy efficient, and since running fast clocks draws a lot of power, Kepler operates from a much slower core clock. Check out the Inside Kepler talk from GTC, about 8 minutes in, for more information.

So the answer to the question is that Kepler has ~2x, not 4x.

The next version of the documents (CUDA 5.0) should explain this better.

edited Sep 10, 2012 at 16:58

answered Sep 10, 2012 at 16:50

Tom

21.2k4 gold badges48 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

P Marecki Over a year ago

I'm starting to see your point. SPs on 3.X do run on the primary GPU clock, while on 2.X they run on the shader clock, which was 2x primary GPU clock. So on Kepler it is "per primary clock cycle", and on Fermi it was "per two shader clock cycles" (= per one primary clock cycle). Access is therefore equally frequent from primary GPU clock perspective, and the 2x bandwidth comes from broader 64-bit words. This is also reflected in "SMX Processing Core Architecture" of the Kepler Whitepaper. Good to learn something about the SPs clock rate than! Thanks!

paleonix · Accepted Answer · 2023-05-11 18:37:36Z

1

As given in

Programming Guide 4.2:

Shared memory has 16 banks that are organized such that successive 32-bit words map to successive banks. Each bank has a bandwidth of 32 bits per two clock cycles.

Kepler Whitepaper:

The shared memory bandwidth for 64b and larger load operations is also doubled compared to the Fermi SM, to 256B per core clock.

For small load operations, 4 times higher bandwidth it is.

edited May 11, 2023 at 18:37

paleonix

3,3255 gold badges20 silver badges42 bronze badges

answered Sep 10, 2012 at 16:50

Fr34K

5446 silver badges19 bronze badges

7 Comments

Fr34K Over a year ago

@Tom: If the number of banks are same, if it takes 32b/2cc on Fermi and 64b/1cc in Kepler, its mathematically 4X. Need more explanation on the logic.

Tom Over a year ago

The question is asking to compare Fermi (2.x) and Kepler (3.x). The quote from the Programming guide about 16 banks is actually in the 1.x section. Kepler vs Fermi is 2x.

Tom Over a year ago

Don't forget that the Kepler clock is slower to conserve energy (see the video I linked to in my answer).

Fr34K Over a year ago

@Tom: In programming guide 4.2, F.5, it hasn't mentioned anything about the bandwidth. Please specify the citation you are providing from.(Section Number of the guide).

Tom Over a year ago

Your confusion arises because the "per two clock cycles" is referring to the "shader clock" in the GeForce paper you referenced. Kepler does not have the 2x clock. For compute (as opposed to graphics) the docs use the 2x clock for 1.x and 2.x devices, but Kepler has eliminated the 2x clock (as described in the video, approx 8 minutes in).

|

Collectives™ on Stack Overflow

Shared memory bandwidth Fermi vs Kepler GPU

2 Answers 2

1 Comment

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related