2

I bet I am doing something very simple wrong. I want to start with an empty 2D numpy array and append arrays to it (with dimensions 1 row by 4 columns).

open_cost_mat_train = np.matrix([])

for i in xrange(10):
    open_cost_mat = np.array([i,0,0,0])
    open_cost_mat_train = np.vstack([open_cost_mat_train,open_cost_mat])

my error trace is:

  File "/Users/me/anaconda/lib/python2.7/site-packages/numpy/core/shape_base.py", line 230, in vstack
    return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
ValueError: all the input array dimensions except for the concatenation axis must match exactly

What am I doing wrong? I have tried append, concatenate, defining the empty 2D array as [[]], as [], array([]) and many others.

1
  • 1
    It is better to construct a list of arrays and apply vstack just once. Repeated concatenation is slow. Commented Jul 26, 2016 at 21:02

2 Answers 2

3

You need to reshape your original matrix so that the number of columns match the appended arrays:

open_cost_mat_train = np.matrix([]).reshape((0,4))

After which, it gives:

open_cost_mat_train

# matrix([[ 0.,  0.,  0.,  0.],
#         [ 1.,  0.,  0.,  0.],
#         [ 2.,  0.,  0.,  0.],
#         [ 3.,  0.,  0.,  0.],
#         [ 4.,  0.,  0.,  0.],
#         [ 5.,  0.,  0.,  0.],
#         [ 6.,  0.,  0.,  0.],
#         [ 7.,  0.,  0.,  0.],
#         [ 8.,  0.,  0.,  0.],
#         [ 9.,  0.,  0.,  0.]])
Sign up to request clarification or add additional context in comments.

Comments

2

If open_cost_mat_train is large I would encourage you to replace the for loop by a vectorized algorithm. I will use the following funtions to show how efficiency is improved by vectorizing loops:

def fvstack():
    import numpy as np
    np.random.seed(100)
    ocmt = np.matrix([]).reshape((0, 4))
    for i in xrange(10):
        x = np.random.random()
        ocm = np.array([x, x + 1, 10*x, x/10])
        ocmt = np.vstack([ocmt, ocm])
    return ocmt

def fshape():
    import numpy as np
    from numpy.matlib import empty
    np.random.seed(100)
    ocmt = empty((10, 4))
    for i in xrange(ocmt.shape[0]):
        ocmt[i, 0] = np.random.random()
    ocmt[:, 1] = ocmt[:, 0] + 1
    ocmt[:, 2] = 10*ocmt[:, 0]
    ocmt[:, 3] = ocmt[:, 0]/10
    return ocmt

I've assumed that the values that populate the first column of ocmt (shorthand for open_cost_mat_train) are obtained from a for loop, and the remaining columns are a function of the first column, as stated in your comments to my original answer. As real costs data are not available, in the forthcoming example the values in the first column are random numbers, and the second, third and fourth columns are the functions x + 1, 10*x and x/10, respectively, where x is the corresponding value in the first column.

In [594]: fvstack()
Out[594]: 
matrix([[  5.43404942e-01,   1.54340494e+00,   5.43404942e+00,   5.43404942e-02],
        [  2.78369385e-01,   1.27836939e+00,   2.78369385e+00,   2.78369385e-02],
        [  4.24517591e-01,   1.42451759e+00,   4.24517591e+00,   4.24517591e-02],
        [  8.44776132e-01,   1.84477613e+00,   8.44776132e+00,   8.44776132e-02],
        [  4.71885619e-03,   1.00471886e+00,   4.71885619e-02,   4.71885619e-04],
        [  1.21569121e-01,   1.12156912e+00,   1.21569121e+00,   1.21569121e-02],
        [  6.70749085e-01,   1.67074908e+00,   6.70749085e+00,   6.70749085e-02],
        [  8.25852755e-01,   1.82585276e+00,   8.25852755e+00,   8.25852755e-02],
        [  1.36706590e-01,   1.13670659e+00,   1.36706590e+00,   1.36706590e-02],
        [  5.75093329e-01,   1.57509333e+00,   5.75093329e+00,   5.75093329e-02]])

In [595]: np.allclose(fvstack(), fshape())
Out[595]: True

In order for the calls to fvstack() and fshape() produce the same results, the random number generator is initialized in both functions through np.random.seed(100). Notice that the equality test has been performed using numpy.allclose instead of fvstack() == fshape() to avoid the round off errors associated to floating point artihmetic.

As for efficiency, the following interactive session shows that initializing ocmt with its final shape is significantly faster than repeatedly stacking rows:

In [596]: import timeit

In [597]: timeit.timeit('fvstack()', setup="from __main__ import fvstack", number=10000)
Out[597]: 1.4884241055042366

In [598]: timeit.timeit('fshape()', setup="from __main__ import fshape", number=10000)
Out[598]: 0.8819408006311278

3 Comments

I gave the arange(n) as an example, but in reality the matrix will be obtaining values from a for loop which obtains data of real "costs" in a cost-sensitive classifier.
What happens if the zero columns are some function of the first column? Will this method still speed things up?
Yes, it will. I have edited again my answer to show you how vectorial code improves speed in your application.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.