I am writing an image similarity algorithm. I am using cv2.calcHist to extract image features. After the features are created I save them to a json file as a list of numpy.float64:
list(numpy.float64(features)), this is a multidimensional vector embedding.
In a second step I load the data from my json and prepare it for sklearn KNeighborsClassifier.
import numpy as np
import json
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
with open('data.json') as f:
jsonData = json.load(f)
X = []
y = []
for image in jsonData['images']:
embeddingData = image['histogram']
X.append(embeddingData)
y.append(image['classification'])
X = np.array(X)
y = np.array(y)
#split dataset into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)
print('Shape of X_train:')
print(X_train.shape)
print('Shape of X_test:')
print(X_test.shape)
print('Shape of y_train:')
print(y_train.shape)
# Create KNN classifier
knn = KNeighborsClassifier(n_neighbors = 1, metric=cosine_similarity)
# Fit the classifier to the data
knn.fit(X_train, y_train)
#show predictions on the test data
y_pred = knn.predict(X_test)
When I run this code, I get the following error on the line
y_pred = knn.predict(X_test)
ValueError: Expected 2D array, got 1D array instead:
array=[1.13707140e-01 9.81128156e-01 2.89475545e-02 ... 0.00000000e+00
5.02811105e-04 1.15502894e-01].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
The output of the shape part is:
Shape of X_train:
(36, 4096)
Shape of X_test:
(9, 4096)
Shape of y_train:
(36,)
I tried to use the reshape suggestion
y_pred = knn.predict(X_test.reshape(-1, 1))
, which helped other people with the same problem like in this post, but which got me
ValueError: X has 1 features, but KNeighborsClassifier is expecting 4096 features as input.
4096 being the dimensions of my histogram features.
I tried reshaping X_train as well for it to match with X_test again:
knn.fit(X_train.reshape(-1, 1), y_train)
, but this leads to
ValueError: Found input variables with inconsistent numbers of samples: [147456, 36]
At first, I tried a slightly different approach based on a knn example where they trained their model on the iris dataset, but there knn.fit would not accept the training data with the same 2D/1D value error. Then I found this example from pyimagesearch which is pretty much what I want to do, except I have the one intermediate step with the json file. The json however is necessary in my case because I want to add other embeddings later and do not want to recalculate everything.
What I do not understand is why knn.fit accepts the data from X_train, but knn.predict does not accept the data from X_test, which were produced in the same way. Why is the error fixed for one case, but not the other?
I already tried the suggested solutions from this, this and this post, but the solution with reshape does not work in my case, as mentioned above. When I try adding extra brackets like this:
y_pred = knn.predict([X_test])
, I get the following error:
ValueError: Found array with dim 3. KNeighborsClassifier expected <= 2.
I also tried to find other examples, but found very few using similar data structures, and the ones I found did not help either.
I also found this question with the same problem, but the accepted answer is not a solution to the problem.
Here's the json file I read from.