Has Kepler two times or four times the bandwidth of Fermi while accessing shared memory?
The Programming Guide states:
Each bank has a bandwidth of 32 bits per two clock cycles
for 2.X, and
Each bank has a bandwidth of 64 bits per clock cycle
for 3.X, so four times higher bandwidth is implied?