I'm working on a coreference resolution model and am trying to feed a large matrix of data into my CNN's input layer. For illustration purposes, I have truncated my data to work with more manageable numbers.
Format Data Function
EMBEDDING_DIM = 400
...
@staticmethod
def get_train_data(data: DefaultDict[ClusteredDictKey, PreCoCoreferenceDatapoint], embedding_model) -> Tuple[List[Tensor], List[Tensor]]:
"""
(n_samples, n_words, n_attributes (word embedding, pos, etc))
[ [ [ word_embedding, pos ] ] ]
xtrain[sentence_sample][word_position][attribute]
xtrain[0][0] -> first word's attributes in first sentence
xtrain[37][5] -> sixth word's attributes in 38th sentence
xtrain[0][0][0] -> word_embedding
xtrain[0][0][1] -> pos one-hot encoding
"""
xtrain = []
ytrain = []
pos_onehot = PreCoParser.get_pos_onehot() # dictionary mapping POS to one-hot encoding
for key, value in data.items():
training_data = []
sentence_embeddings = PreCoParser.get_embedding_for_sent(key.sentence, embedding_model) # Returns tensor (ndarray) of shape: (tokens_in_sent, EMBEDDING_DIM)
pos = PreCoParser.get_pos_onehot_for_sent(key.sentence, pos_onehot) # Returns tensor (ndarray) of shape: (45,)
assert sentence_embeddings.shape == (len(key.sentence), EMBEDDING_DIM)
assert pos.shape == (len(key.sentence), 45)
for i, embedding in enumerate(sentence_embeddings):
training_data.append(np.asarray([embedding, np.asarray(pos[i])]))
cluster_indices = list(sum([cluster.indices for cluster in value], ()))
# Delete every third element to remove sentence index
del cluster_indices[0::3]
if len(training_data) > 0:
xtrain.append(np.asarray(training_data))
ytrain.append(np.asarray(cluster_indices) / len(key.sentence)) # normalize output data
gc.collect()
return (np.asarray(xtrain), np.asarray(ytrain))
Abbreviated Issue
In short, I have a NumPy array that I am able to successfully run the following assert on:
assert self.xtrain[0][0][0].shape == (EMBEDDING_DIM,)
implying, to me at least, that the array contains 4 dimensions, with the final vector containing EMBEDDING_DIM number of elements (400 in my case).
However, running the following code yields a weird result:
>>> self.xtrain.shape
(500,)
>>> self.xtrain[0].shape # on sentence with 11 words
(11,2)
>>> self.xtrain[0][0].shape # two attributes
(2,)
>>> self.xtrain[0][0][0].shape
(400,)
>>> self.xtrain[0][0][1].shape
(45,)
where 500 refers to my truncated number of samples (and all outputs are correct with what I expected). Additionally, when feeding this data through a simple Keras Conv2D input layer, I am greeted with the following error:
self.model.fit(self.xtrain, self.ytrain, epochs=1)
File "/usr/local/lib/python3.7/site-packages/keras/engine/training.py", line 1154, in fit
batch_size=batch_size)
File "/usr/local/lib/python3.7/site-packages/keras/engine/training.py", line 579, in _standardize_user_data
exception_prefix='input')
File "/usr/local/lib/python3.7/site-packages/keras/engine/training_utils.py", line 135, in standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking input: expected conv2d_1_input to have 4 dimensions, but got array with shape (499, 1)
I'll happily post more code, need be, but any help at this current point would be greatly appreciated!
dtypeat each level as well. It appears thatxtrainis a 1d array ofobject. One, maybe all elements are (11,2) arrays, object as well. It's the mix of shapes at the lowest level, 400 and 45, that's preventing you from getting a n-d array all the way down.object,object,float32,uint8. I assume my next step is to normalize thedtypeintofloat32?