2

Main Problem:

numpy arrays of the same type and same size are not being column stacked together using np.hstack, np.column_stack, or np.concatenate(axis=1).

Explaination:

I don't understand what properties of a numpy array can change such that numpy.hstack, numpy.column_stack and numpy.concatenate(axis=1) do not work properly. I am having a problem getting my real program to stack by column - it only appends to the rows. Is there some property of a numpy array which would cause this to be true? It doesn't throw an error, it just doesn't do the "right" or "normal" behavior.

I have tried a simple case which works as I would expect it to:

input:
a = np.array([['1', '2'], ['3', '4']], dtype=object)
b = np.array([['5', '6'], ['7', '8']], dtype=object)
np.hstack(a, b)
output: 
np.array([['1', '2', '5', '6'], ['3', '4', '7', '8']], dtype=object)

That's perfectly fine by me, and what I want.

However, what I get from my program is this:

First array:
[['29.8989', '0'] ['29.8659', '-8.54805e-005'] ['29.902', '-0.00015875']
 ..., ['908.791', '-0.015765'] ['908.073', '-0.0154842'] []]

Second array (to be added on in columns):
[['29.8989', '26.8556'] ['29.8659', '26.7969'] ['29.902', '29.0183'] ...,
 ['908.791', '943.621'] ['908.073', '940.529'] []]

What should be the two arrays side by side or in columns:
[['29.8989', '0'] ['29.8659', '-8.54805e-005'] ['29.902', '-0.00015875']
 ..., ['908.791', '943.621'] ['908.073', '940.529'] []]

Clearly, this isn't the right answer.

The module creating this problem is rather long (I will give it at the bottom), but here is a simplification of it which still works (performs the right column stacking) like the first example:

import numpy as np

def contiguous_regions(condition):
    d = np.diff(condition)
    idx, = d.nonzero() 
    idx += 1
    if condition[0]:
        idx = np.r_[0, idx]
    if condition[-1]:
        idx = np.r_[idx, condition.size]
    idx.shape = (-1,2)
    return idx

def is_number(s):
    try:
        np.float64(s)
        return True
    except ValueError:
        return False

total_array = np.array([['1', '2'], ['3', '4'], ['strings','here'], ['5', '6'], ['7', '8']], dtype=object)
where_number = np.array(map(is_number, total_array))
contig_ixs = contiguous_regions(where_number)
print contig_ixs
t = tuple(total_array[s[0]:s[1]] for s in contig_ixs)
print t
print np.hstack(t)

It basically looks through an array of lists and finds the longest set of continuous numbers. I would like to column stack those sets of data if they are of the same length.

Here is the real module providing the problem:

import numpy as np

def retrieve_XY(file_path):
    # XY data is read in from a file in text format
    file_data = open(file_path).readlines()

    # The list of strings (lines in the file) is made into a list of lists while splitting by whitespace and removing commas
    file_data = np.array(map(lambda line: line.rstrip('\n').replace(',',' ').split(), file_data))

    # Remove empty lists, make into numpy array
    xy_array = np.array(filter(None, column_stacked_data_chain))

    # Each line is searched to make sure that all items in the line are a number
    where_num = np.array(map(is_number, file_data))

    # The data is searched for the longest contiguous chain of numbers
    contig = contiguous_regions(where_num)
    try:
        # Data lengths (number of rows) for each set of data in the file
        data_lengths = contig[:,1] - contig[:,0]
        # Get the maximum length of data (max number of contiguous rows) in the file
        maxs = np.amax(data_lengths)
        # Find the indices for where this long list of data is (index within the indices array of the file)
        # If there are two equally long lists of data, get both indices 
        longest_contig_idx = np.where(data_lengths == maxs)
    except ValueError:
        print 'Problem finding contiguous data'
        return np.array([])

###############################################################################################
###############################################################################################
# PROBLEM ORIGINATES HERE
    # Starting and stopping indices of the contiguous data are stored
    ss = contig[longest_contig_idx]
    # The file data with this longest contiguous chain of numbers
    # If there are multiple sets of data of the same length, they are added in columns
    longest_data_chains = tuple([file_data[i[0]:i[1]] for i in ss])
    print "First array:"
    print longest_data_chains[0]
    print 
    print "Second array (to be added on in columns):"
    print longest_data_chains[1]
    column_stacked_data_chain = np.concatenate(longest_data_chains, axis=1)

    print
    print "What should be the two arrays side by side or in columns:"
    print column_stacked_data_chain

###############################################################################################
###############################################################################################

    xy = np.array(zip(*xy_array), dtype=float)
    return xy

#http://stackoverflow.com/questions/4494404/find-large-number-of-consecutive-values-fulfilling-condition-in-a-numpy-array
def contiguous_regions(condition):
    """Finds contiguous True regions of the boolean array "condition". Returns
    a 2D array where the first column is the start index of the region and the
    second column is the end index."""

    # Find the indicies of changes in "condition"
    d = np.diff(condition)
    idx, = d.nonzero() 

    # We need to start things after the change in "condition". Therefore, 
    # we'll shift the index by 1 to the right.
    idx += 1

    if condition[0]:
        # If the start of condition is True prepend a 0
        idx = np.r_[0, idx]

    if condition[-1]:
        # If the end of condition is True, append the length of the array
        idx = np.r_[idx, condition.size] # Edit

    # Reshape the result into two columns
    idx.shape = (-1,2)
    return idx

def is_number(s):
    try:
        np.float64(s)
        return True
    except ValueError:
        return False

UPDATE: I got it to work with the help of @hpaulj . Apparently the fact that the data was structured like np.array([['1','2'],['3','4']]) in both cases was not sufficient since the real case I was using had a dtype=object and there were some strings in the lists. Therefore, numpy was seeing a 1d array instead of a 2d array, which is required.

The solution which fixed this was calling a map(float, data) to every list that was given by the readlines function.

Here is what I ended up with:

import numpy as np

def retrieve_XY(file_path):
    # XY data is read in from a file in text format
    file_data = open(file_path).readlines()

    # The list of strings (lines in the file) is made into a list of lists while splitting by whitespace and removing commas
    file_data = map(lambda line: line.rstrip('\n').replace(',',' ').split(), file_data)

    # Remove empty lists, make into numpy array
    xy_array = np.array(filter(None, file_data))

    # Each line is searched to make sure that all items in the line are a number
    where_num = np.array(map(is_number, xy_array))

    # The data is searched for the longest contiguous chain of numbers
    contig = contiguous_regions(where_num)
    try:
        # Data lengths
        data_lengths = contig[:,1] - contig[:,0]
        # All maximums in contiguous data
        maxs = np.amax(data_lengths)
        longest_contig_idx = np.where(data_lengths == maxs)
    except ValueError:
        print 'Problem finding contiguous data'
        return np.array([])
    # Starting and stopping indices of the contiguous data are stored
    ss = contig[longest_contig_idx]

    print ss
    # The file data with this longest contiguous chain of numbers
    # Float must be cast to each value in the lists of the contiguous data and cast to a numpy array 
    longest_data_chains = np.array([[map(float, n) for n in xy_array[i[0]:i[1]]] for i in ss])

    # If there are multiple sets of data of the same length, they are added in columns
    column_stacked_data_chain = np.hstack(longest_data_chains)

    xy = np.array(zip(*column_stacked_data_chain), dtype=float)
    return xy

#http://stackoverflow.com/questions/4494404/find-large-number-of-consecutive-values-fulfilling-condition-in-a-numpy-array
def contiguous_regions(condition):
    """Finds contiguous True regions of the boolean array "condition". Returns
    a 2D array where the first column is the start index of the region and the
    second column is the end index."""

    # Find the indicies of changes in "condition"
    d = np.diff(condition)
    idx, = d.nonzero() 

    # We need to start things after the change in "condition". Therefore, 
    # we'll shift the index by 1 to the right.
    idx += 1

    if condition[0]:
        # If the start of condition is True prepend a 0
        idx = np.r_[0, idx]

    if condition[-1]:
        # If the end of condition is True, append the length of the array
        idx = np.r_[idx, condition.size] # Edit

    # Reshape the result into two columns
    idx.shape = (-1,2)
    return idx

def is_number(s):
    try:
        np.float64(s)
        return True
    except ValueError:
        return False

This function will now take in a file and output the longest contiguous number type data found within it. If there are multiple data sets found with the same length, it column stacks them.

8
  • 1
    What's the shape of each of the items in tuple t? Commented Feb 21, 2014 at 20:54
  • @hpaulj It's going to depend on the file which is input into it of course, but the input file I'm using at the moment gives (1922,) for two np.arrays within the tuple. The array lengths should be the exactly the same, since they are created from a np.amax call on the length of the datasets, and np.amax will only return two objects if they are of the exact same length. Commented Feb 21, 2014 at 23:04
  • 1
    After the stack (of 2 of them) what do you want? An array with a (1922,2) shape, (2,1922) or (3844,)? In the short example a is (2,2). In the long case, should the individual arrays be 1d or 2d? Commented Feb 21, 2014 at 23:33
  • 1
    I need vstack((a,a,a)).T to produce a (m,3) array. Note that vstack does concatenate([atleast_2d(_m) for _m in tup], 0). Commented Feb 22, 2014 at 2:44
  • 1
    In your small example, the 2 arrays in t are each (2,2). I assume that in the large case you want each of arrays in tuple to be (1922,2). If that is the case, hstack should work fine. But with (1922,) it won't because that is 1d. What is that [] doing at the end of 'First Array'? Commented Feb 23, 2014 at 3:23

1 Answer 1

1

It's the empty list at the end of your array's that's causing your problem:

>>> a = np.array([[1, 2], [3, 4]])
>>> b = np.array([[1, 2], [3, 4], []])
>>> a.shape
(2L, 2L)
>>> a.dtype
dtype('int32')
>>> b.shape
(3L,)
>>> b.dtype
dtype('O')

Because of that empty list at the end, instead of creating a 2D array it is creating a 1D, with every item holding a two item long list object.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the tip. I checked out the shapes and both are still (1922,). Also, I placed the xy_array = np.array(filter(None, file_data)) before the contiguous data is searched for and it still gives me the same shape for both arrays, and still returns the 1-dimensional array.
Because I placed the np.array(filter(None, file_data)) before the contiguous data check, the two arrays are ensured to be exactly the same size. If they weren't the same size, there would be only one array - as the np.hstack only works on the data which is "largest" (in length). (The tuple would contain only one element if only one "largest" set of data was found.)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.