1

I am writing an image similarity algorithm. I am using cv2.calcHist to extract image features. After the features are created I save them to a json file as a list of numpy.float64: list(numpy.float64(features)), this is a multidimensional vector embedding.

In a second step I load the data from my json and prepare it for sklearn KNeighborsClassifier.

import numpy as np
import json
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity


with open('data.json') as f:
    jsonData = json.load(f)

X = []
y = []

for image in jsonData['images']:
    embeddingData = image['histogram']
    X.append(embeddingData)
    y.append(image['classification'])

X = np.array(X)
y = np.array(y)

#split dataset into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

print('Shape of X_train:')
print(X_train.shape)
print('Shape of X_test:')
print(X_test.shape)
print('Shape of y_train:')
print(y_train.shape)

# Create KNN classifier
knn = KNeighborsClassifier(n_neighbors = 1, metric=cosine_similarity)
# Fit the classifier to the data
knn.fit(X_train, y_train)

#show predictions on the test data
y_pred = knn.predict(X_test)

When I run this code, I get the following error on the line

y_pred = knn.predict(X_test)
ValueError: Expected 2D array, got 1D array instead:
array=[1.13707140e-01 9.81128156e-01 2.89475545e-02 ... 0.00000000e+00
 5.02811105e-04 1.15502894e-01].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

The output of the shape part is:

Shape of X_train:
(36, 4096)
Shape of X_test:
(9, 4096)
Shape of y_train:
(36,)

I tried to use the reshape suggestion

y_pred = knn.predict(X_test.reshape(-1, 1))

, which helped other people with the same problem like in this post, but which got me

ValueError: X has 1 features, but KNeighborsClassifier is expecting 4096 features as input.

4096 being the dimensions of my histogram features.

I tried reshaping X_train as well for it to match with X_test again:

knn.fit(X_train.reshape(-1, 1), y_train)

, but this leads to

ValueError: Found input variables with inconsistent numbers of samples: [147456, 36]

At first, I tried a slightly different approach based on a knn example where they trained their model on the iris dataset, but there knn.fit would not accept the training data with the same 2D/1D value error. Then I found this example from pyimagesearch which is pretty much what I want to do, except I have the one intermediate step with the json file. The json however is necessary in my case because I want to add other embeddings later and do not want to recalculate everything.

What I do not understand is why knn.fit accepts the data from X_train, but knn.predict does not accept the data from X_test, which were produced in the same way. Why is the error fixed for one case, but not the other?

I already tried the suggested solutions from this, this and this post, but the solution with reshape does not work in my case, as mentioned above. When I try adding extra brackets like this:

y_pred = knn.predict([X_test])

, I get the following error:

ValueError: Found array with dim 3. KNeighborsClassifier expected <= 2.

I also tried to find other examples, but found very few using similar data structures, and the ones I found did not help either.

I also found this question with the same problem, but the accepted answer is not a solution to the problem.

Here's the json file I read from.

14
  • 1
    What are the dimensions/shape of X_train, X_test, Y_train? Is the error only on predict and not on fit? Commented Dec 10, 2024 at 22:35
  • "does not seem to work", details required. minimal reproducible example is required, but incomplete. you read from a file but don't provide it. Commented Dec 11, 2024 at 10:12
  • @rehaqds Yes, the error is only on predict and not on fit. Shape of X_train: (36, 4096) Shape of X_test: (9, 4096) Shape of y_train: (36,) Commented Dec 11, 2024 at 19:43
  • 1
    In your question, there is still a mix of different experiments that you made which makes things quite confusing. I looked at the numbers in the main error "Expected 2D array, got 1D array instead", the first 3 values come from an histogram part and the last 3 from the embeddingImage just after it so it seems that you tried to concatenate both class of features unlike what we see in the code. Moreover the error message seems to say that at that moment it received only one item and not the 9 expected for the test set. Commented Dec 13, 2024 at 23:02
  • 1
    ok. that was tricky but the culprit is the metric ! Use metric='cosine' and it should work. In the doc (scikit-learn.org/stable/modules/generated/…) you will find the strings possible and it also says that you can use a function as you tried to do but the function should take 2 1D arrays and return a scalar whereas cosine_similarity() takes 2 2D arrays and return a 2D array. So the error message was not directly to what was given to predict but to what was given to the metric that was called by predict. Commented Dec 14, 2024 at 13:50

1 Answer 1

0

As there is the error message "Expected 2D array, got 1D array instead" on the instruction knn.predict(X_test), it is logical to think that X_test doesn't have the good dimensions but as you said X_test does have the correct dimensions so at first sight it doesn't seem to make sense.

Indeed, he error message is somewhat misleading in this particular case as the problem is hidden in the definition of knn 2 lines above and in particular to its metric:

knn = KNeighborsClassifier(n_neighbors = 1, metric=cosine_similarity)

If you change the metric for 'cosine', it will work.

Not very intuitive but in the doc you will find the strings possible for metric and it also says that you can use a function as you tried to do though the function should take two 1D arrays as inputs and return a scalar:

metric: str or callable, default=’minkowski’ Metric to use for distance computation. Default is “minkowski”, which results in the standard Euclidean distance when p = 2. See the documentation of scipy.spatial.distance and the metrics listed in distance_metrics for valid metric values. [...] If metric is a callable function, it takes two arrays representing 1D vectors as inputs and must return one value indicating the distance between those vectors [...]

But if you look at the definition of cosine_similarity(), it says that this function takes two 2D arrays and return one 2D array.

That's why you got the error message "expected 2D, got 1D". The error message was not directly linked to what was given to predict() but to what was given to the metric function that was called by predict() !

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.