1

I'm working on a coreference resolution model and am trying to feed a large matrix of data into my CNN's input layer. For illustration purposes, I have truncated my data to work with more manageable numbers.

Format Data Function

EMBEDDING_DIM = 400

...

@staticmethod
def get_train_data(data: DefaultDict[ClusteredDictKey, PreCoCoreferenceDatapoint], embedding_model) -> Tuple[List[Tensor], List[Tensor]]:
    """
    (n_samples, n_words, n_attributes (word embedding, pos, etc))
    [ [ [ word_embedding, pos ] ] ]

    xtrain[sentence_sample][word_position][attribute]
    xtrain[0][0] -> first word's attributes in first sentence
    xtrain[37][5] -> sixth word's attributes in 38th sentence
    xtrain[0][0][0] -> word_embedding
    xtrain[0][0][1] -> pos one-hot encoding
    """
    xtrain = []
    ytrain = []
    pos_onehot = PreCoParser.get_pos_onehot() # dictionary mapping POS to one-hot encoding

    for key, value in data.items():
        training_data = []

        sentence_embeddings = PreCoParser.get_embedding_for_sent(key.sentence, embedding_model) # Returns tensor (ndarray) of shape: (tokens_in_sent, EMBEDDING_DIM)
        pos = PreCoParser.get_pos_onehot_for_sent(key.sentence, pos_onehot) # Returns tensor (ndarray) of shape: (45,)

        assert sentence_embeddings.shape == (len(key.sentence), EMBEDDING_DIM)
        assert pos.shape == (len(key.sentence), 45)

        for i, embedding in enumerate(sentence_embeddings):
            training_data.append(np.asarray([embedding, np.asarray(pos[i])]))

        cluster_indices = list(sum([cluster.indices for cluster in value], ()))
        # Delete every third element to remove sentence index
        del cluster_indices[0::3]

        if len(training_data) > 0:
            xtrain.append(np.asarray(training_data))
            ytrain.append(np.asarray(cluster_indices) / len(key.sentence)) # normalize output data

    gc.collect()
    return (np.asarray(xtrain), np.asarray(ytrain))

Abbreviated Issue

In short, I have a NumPy array that I am able to successfully run the following assert on:

assert self.xtrain[0][0][0].shape == (EMBEDDING_DIM,)

implying, to me at least, that the array contains 4 dimensions, with the final vector containing EMBEDDING_DIM number of elements (400 in my case).

However, running the following code yields a weird result:

>>> self.xtrain.shape
(500,) 
>>> self.xtrain[0].shape # on sentence with 11 words
(11,2)
>>> self.xtrain[0][0].shape # two attributes
(2,)
>>> self.xtrain[0][0][0].shape
(400,)
>>> self.xtrain[0][0][1].shape
(45,)

where 500 refers to my truncated number of samples (and all outputs are correct with what I expected). Additionally, when feeding this data through a simple Keras Conv2D input layer, I am greeted with the following error:

    self.model.fit(self.xtrain, self.ytrain, epochs=1)
  File "/usr/local/lib/python3.7/site-packages/keras/engine/training.py", line 1154, in fit
    batch_size=batch_size)
  File "/usr/local/lib/python3.7/site-packages/keras/engine/training.py", line 579, in _standardize_user_data
    exception_prefix='input')
  File "/usr/local/lib/python3.7/site-packages/keras/engine/training_utils.py", line 135, in standardize_input_data
    'with shape ' + str(data_shape))
ValueError: Error when checking input: expected conv2d_1_input to have 4 dimensions, but got array with shape (499, 1)

I'll happily post more code, need be, but any help at this current point would be greatly appreciated!

3
  • 1
    Check the dtype at each level as well. It appears that xtrain is a 1d array of object. One, maybe all elements are (11,2) arrays, object as well. It's the mix of shapes at the lowest level, 400 and 45, that's preventing you from getting a n-d array all the way down. Commented Mar 10, 2020 at 1:09
  • You're totally right, my dtypes are all intermingled. From outter to inner, they are: object, object, float32, uint8. I assume my next step is to normalize the dtype into float32? Commented Mar 10, 2020 at 1:29
  • 1
    You get object dtype when dimensions don't match Commented Mar 10, 2020 at 2:18

2 Answers 2

1

I would like to leave this as a comment but because my reputation is not >50, it wouldn't let me :(

The only guess that I'm making with the error is that for model.fit() you've misplaced ytrain with xtrain, since for ytrain I could imagine an input of shape (499, 1). I'm afraid I would need more code for when the model is fed with the input data and labels.

Sign up to request clarification or add additional context in comments.

Comments

0

To anyone else who may find this in the future, the issue was partially correctly pointed out by hpaulj. The other issue I was facing was with my data parsing. I ultimately converted the input data to a NumPy array of shape (n_training_samples, INPUT_MAXLEN, 2, EMBEDDING_DIM). Once I normalized the structure of the matrix and removed arrays with dtype='object', everything worked perfectly.

I referred to this website to effectively initialize an empty NumPy array: http://akuederle.com/create-numpy-array-with-for-loop

Final code:

EMBEDDING_DIM = 400

...
@staticmethod
def get_train_data(data: DefaultDict[ClusteredDictKey, PreCoCoreferenceDatapoint], inputmaxlen: int, embedding_model) -> Tuple[List[Tensor], List[Tensor]]:
    """
    (n_samples, n_words, n_attributes (word embedding, pos, etc))
    [ [ [ word_embedding, pos ] ] ]

    xtrain[sentence_sample][word_position][attribute]
    xtrain[37][5] -> sixth word's attributes in 38th sentence (np.ndarray containing two np.ndarrays)
    xtrain[0][0][0] -> word_embedding (np.ndarray)
    xtrain[0][0][1] -> pos one-hot encoding (np.ndarray)
    """
    xtrain = np.empty((len(data), inputmaxlen, 2, EMBEDDING_DIM))
    ytrain = []
    pos_onehot = PreCoParser.get_pos_onehot()

    for i, (key, value) in enumerate(data.items()):
        training_data = []

        sentence_embeddings = PreCoParser.get_embedding_for_sent(key.sentence, embedding_model)
        pos = PreCoParser.get_pos_onehot_for_sent(key.sentence, pos_onehot)
        assert sentence_embeddings.shape == (len(key.sentence), EMBEDDING_DIM)
        assert pos.shape == (len(key.sentence), 45)

        for j, word_embeddings in enumerate(sentence_embeddings):
            pos_embeddings = sequence.pad_sequences([pos[j]], maxlen=EMBEDDING_DIM, dtype='float32', padding='post')[0]
            xtrain[i][j][0] = word_embeddings
            xtrain[i][j][1] = pos_embeddings

        cluster_indices = list(sum([cluster.indices for cluster in value], ()))
        # Delete every third element to remove sentence index
        del cluster_indices[0::3]

        ytrain.append(np.asarray(cluster_indices) / len(key.sentence))

    gc.collect()
    return (np.asarray(xtrain, dtype='float32'), np.asarray(ytrain, dtype='float32'))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.