Here's one way to do it.
Step #1 Get all combinations corresponding to all indices being calculated with np.array(it.multi_index) in the code. On this, one can leverage product from itertools.
Step #2 Perform the L2 norm calculations across all combinations in a vectorized manner.
Step #3 Finally do C(x)=(exp(-|x|^2) in an elementwise manner.
# Get combinations using itertools.product
combs = np.array(list(product(range(N), repeat=4)))
# Perform L2 norm and elementwise exponential calculations to get final o/p
out = np.exp(-np.sqrt((combs**2).sum(1))**2).reshape(N,N,N,N)
Runtime tests and verify output -
In [42]: def vectorized_app(N):
...: combs = np.array(list(product(range(N), repeat=4)))
...: return np.exp(-np.sqrt((combs**2).sum(1))**2).reshape(N,N,N,N)
...:
...: def original_app(N):
...: C=np.zeros([N,N,N,N])
...: it=np.nditer(C, flags=['multi_index'], op_flags=['readwrite'])
...: while not it.finished:
...: diff_n=np.linalg.norm(np.array(it.multi_index))
...: it[0]=np.exp(-diff_n**2)
...: it.iternext()
...: return C
...:
In [43]: N = 10
In [44]: %timeit original_app(N)
1 loops, best of 3: 288 ms per loop
In [45]: %timeit vectorized_app(N)
100 loops, best of 3: 8.63 ms per loop
In [46]: np.allclose(vectorized_app(N),original_app(N))
Out[46]: True
nditertutorial page. At the end it has an example of usingnditerin Cython code. That is fast. But in Python code it isn't faster than other iteration methods. It's better to avoid iteration entirely.