Your hypothesis "we can see when running frequently, the code becomes much faster" is not quite correct.
This loop
#just to create a time interval, seems this disturb cpu cache?
for i in range(1000000):
pass
isn't merely a simple time delay. In Python 2 it creates a list of 1,000,000 integer objects, loops over it, and then destroys it. In Python 3, range() is a generator, so it's much less wasteful of memory; the Python 2 equivalent is xrange().
In the standard CPython implementation, a list is an array of pointers to the objects it contains. On a 32 bit system, an integer object occupies 12 bytes, an empty list object occupies 32 bytes, and of course a pointer occupies 4 bytes. So range(1000000) consumes a total of 32 + 1000000 * (4 + 12) = 16000032 bytes.
Once the loop finishes, the reference count on the range() list drops to zero so it and its contents are eligible for garbage collection. It takes a little bit of time to garbage collect a million integers, so it's not surprising that you notice a time delay after the loop is finished. Also, the process of creating & deallocating all those objects will generally cause some memory fragmentation, and that will impact the efficiency of subsequent code.
FWIW, on my 2GHz Pentium 4 Linux system, your 1st code block prints values in the range 38 - 41µs, but if I change the loop to use xrange() then the time drops to 31 - 34µs.
If the code is modified to prevent garbage collection of the range() until after sum(a) is calculated, then it makes little difference in the timing whether we use range() or xrange().
import time
a = range(500)
sum(a)
bigrange = xrange(1000000)
for i in bigrange:
pass
st = time.time()
sum(a)
print (time.time() - st) * 1e6
#Maintain a reference to bigrange to prevent it
#being garbage collected earlier
print sum(bigrange)