Main Problem:
numpy arrays of the same type and same size are not being column stacked together using np.hstack, np.column_stack, or np.concatenate(axis=1).
Explaination:
I don't understand what properties of a numpy array can change such that numpy.hstack, numpy.column_stack and numpy.concatenate(axis=1) do not work properly. I am having a problem getting my real program to stack by column - it only appends to the rows. Is there some property of a numpy array which would cause this to be true? It doesn't throw an error, it just doesn't do the "right" or "normal" behavior.
I have tried a simple case which works as I would expect it to:
input:
a = np.array([['1', '2'], ['3', '4']], dtype=object)
b = np.array([['5', '6'], ['7', '8']], dtype=object)
np.hstack(a, b)
output:
np.array([['1', '2', '5', '6'], ['3', '4', '7', '8']], dtype=object)
That's perfectly fine by me, and what I want.
However, what I get from my program is this:
First array:
[['29.8989', '0'] ['29.8659', '-8.54805e-005'] ['29.902', '-0.00015875']
..., ['908.791', '-0.015765'] ['908.073', '-0.0154842'] []]
Second array (to be added on in columns):
[['29.8989', '26.8556'] ['29.8659', '26.7969'] ['29.902', '29.0183'] ...,
['908.791', '943.621'] ['908.073', '940.529'] []]
What should be the two arrays side by side or in columns:
[['29.8989', '0'] ['29.8659', '-8.54805e-005'] ['29.902', '-0.00015875']
..., ['908.791', '943.621'] ['908.073', '940.529'] []]
Clearly, this isn't the right answer.
The module creating this problem is rather long (I will give it at the bottom), but here is a simplification of it which still works (performs the right column stacking) like the first example:
import numpy as np
def contiguous_regions(condition):
d = np.diff(condition)
idx, = d.nonzero()
idx += 1
if condition[0]:
idx = np.r_[0, idx]
if condition[-1]:
idx = np.r_[idx, condition.size]
idx.shape = (-1,2)
return idx
def is_number(s):
try:
np.float64(s)
return True
except ValueError:
return False
total_array = np.array([['1', '2'], ['3', '4'], ['strings','here'], ['5', '6'], ['7', '8']], dtype=object)
where_number = np.array(map(is_number, total_array))
contig_ixs = contiguous_regions(where_number)
print contig_ixs
t = tuple(total_array[s[0]:s[1]] for s in contig_ixs)
print t
print np.hstack(t)
It basically looks through an array of lists and finds the longest set of continuous numbers. I would like to column stack those sets of data if they are of the same length.
Here is the real module providing the problem:
import numpy as np
def retrieve_XY(file_path):
# XY data is read in from a file in text format
file_data = open(file_path).readlines()
# The list of strings (lines in the file) is made into a list of lists while splitting by whitespace and removing commas
file_data = np.array(map(lambda line: line.rstrip('\n').replace(',',' ').split(), file_data))
# Remove empty lists, make into numpy array
xy_array = np.array(filter(None, column_stacked_data_chain))
# Each line is searched to make sure that all items in the line are a number
where_num = np.array(map(is_number, file_data))
# The data is searched for the longest contiguous chain of numbers
contig = contiguous_regions(where_num)
try:
# Data lengths (number of rows) for each set of data in the file
data_lengths = contig[:,1] - contig[:,0]
# Get the maximum length of data (max number of contiguous rows) in the file
maxs = np.amax(data_lengths)
# Find the indices for where this long list of data is (index within the indices array of the file)
# If there are two equally long lists of data, get both indices
longest_contig_idx = np.where(data_lengths == maxs)
except ValueError:
print 'Problem finding contiguous data'
return np.array([])
###############################################################################################
###############################################################################################
# PROBLEM ORIGINATES HERE
# Starting and stopping indices of the contiguous data are stored
ss = contig[longest_contig_idx]
# The file data with this longest contiguous chain of numbers
# If there are multiple sets of data of the same length, they are added in columns
longest_data_chains = tuple([file_data[i[0]:i[1]] for i in ss])
print "First array:"
print longest_data_chains[0]
print
print "Second array (to be added on in columns):"
print longest_data_chains[1]
column_stacked_data_chain = np.concatenate(longest_data_chains, axis=1)
print
print "What should be the two arrays side by side or in columns:"
print column_stacked_data_chain
###############################################################################################
###############################################################################################
xy = np.array(zip(*xy_array), dtype=float)
return xy
#http://stackoverflow.com/questions/4494404/find-large-number-of-consecutive-values-fulfilling-condition-in-a-numpy-array
def contiguous_regions(condition):
"""Finds contiguous True regions of the boolean array "condition". Returns
a 2D array where the first column is the start index of the region and the
second column is the end index."""
# Find the indicies of changes in "condition"
d = np.diff(condition)
idx, = d.nonzero()
# We need to start things after the change in "condition". Therefore,
# we'll shift the index by 1 to the right.
idx += 1
if condition[0]:
# If the start of condition is True prepend a 0
idx = np.r_[0, idx]
if condition[-1]:
# If the end of condition is True, append the length of the array
idx = np.r_[idx, condition.size] # Edit
# Reshape the result into two columns
idx.shape = (-1,2)
return idx
def is_number(s):
try:
np.float64(s)
return True
except ValueError:
return False
UPDATE:
I got it to work with the help of @hpaulj . Apparently the fact that the data was structured like np.array([['1','2'],['3','4']]) in both cases was not sufficient since the real case I was using had a dtype=object and there were some strings in the lists. Therefore, numpy was seeing a 1d array instead of a 2d array, which is required.
The solution which fixed this was calling a map(float, data) to every list that was given by the readlines function.
Here is what I ended up with:
import numpy as np
def retrieve_XY(file_path):
# XY data is read in from a file in text format
file_data = open(file_path).readlines()
# The list of strings (lines in the file) is made into a list of lists while splitting by whitespace and removing commas
file_data = map(lambda line: line.rstrip('\n').replace(',',' ').split(), file_data)
# Remove empty lists, make into numpy array
xy_array = np.array(filter(None, file_data))
# Each line is searched to make sure that all items in the line are a number
where_num = np.array(map(is_number, xy_array))
# The data is searched for the longest contiguous chain of numbers
contig = contiguous_regions(where_num)
try:
# Data lengths
data_lengths = contig[:,1] - contig[:,0]
# All maximums in contiguous data
maxs = np.amax(data_lengths)
longest_contig_idx = np.where(data_lengths == maxs)
except ValueError:
print 'Problem finding contiguous data'
return np.array([])
# Starting and stopping indices of the contiguous data are stored
ss = contig[longest_contig_idx]
print ss
# The file data with this longest contiguous chain of numbers
# Float must be cast to each value in the lists of the contiguous data and cast to a numpy array
longest_data_chains = np.array([[map(float, n) for n in xy_array[i[0]:i[1]]] for i in ss])
# If there are multiple sets of data of the same length, they are added in columns
column_stacked_data_chain = np.hstack(longest_data_chains)
xy = np.array(zip(*column_stacked_data_chain), dtype=float)
return xy
#http://stackoverflow.com/questions/4494404/find-large-number-of-consecutive-values-fulfilling-condition-in-a-numpy-array
def contiguous_regions(condition):
"""Finds contiguous True regions of the boolean array "condition". Returns
a 2D array where the first column is the start index of the region and the
second column is the end index."""
# Find the indicies of changes in "condition"
d = np.diff(condition)
idx, = d.nonzero()
# We need to start things after the change in "condition". Therefore,
# we'll shift the index by 1 to the right.
idx += 1
if condition[0]:
# If the start of condition is True prepend a 0
idx = np.r_[0, idx]
if condition[-1]:
# If the end of condition is True, append the length of the array
idx = np.r_[idx, condition.size] # Edit
# Reshape the result into two columns
idx.shape = (-1,2)
return idx
def is_number(s):
try:
np.float64(s)
return True
except ValueError:
return False
This function will now take in a file and output the longest contiguous number type data found within it. If there are multiple data sets found with the same length, it column stacks them.
t?(1922,)for twonp.arrayswithin the tuple. The array lengths should be the exactly the same, since they are created from anp.amaxcall on the length of the datasets, andnp.amaxwill only return two objects if they are of the exact same length.(1922,2)shape,(2,1922)or(3844,)? In the short exampleais(2,2). In the long case, should the individual arrays be 1d or 2d?vstack((a,a,a)).Tto produce a(m,3)array. Note thatvstackdoesconcatenate([atleast_2d(_m) for _m in tup], 0).tare each(2,2). I assume that in the large case you want each of arrays in tuple to be(1922,2). If that is the case,hstackshould work fine. But with(1922,)it won't because that is 1d. What is that[]doing at the end of 'First Array'?