1

I am trying to find Euclidean distance between two points. I have around 13000 number of rows in Dataframe. I have to find Euclidean distance for each each row against all 13000 number of rows and then get the similarity scores for that. Running the code is more time consuming (more than 24 hrs).

Below is my code:

# Empty the existing database
df_similar = pd.DataFrame()
print(df_similar)

# 'i' refers all id's in the dataframe
# Length of df_distance is 13000

for i in tqdm(range(len(df_distance))):
    df_50 = pd.DataFrame(columns=['id', 'id_match', 'similarity_distance'])

    # in Order to avoid the duplicates we each time assign the "index" value with "i" so that we are starting the 
    # comparision from that index of "i" itself.
    if i < len(df_distance):
        index = i

    # This loop is used to iterate one id with all 13000 id's. This is time consuming as we have to iterate each id against all 13000 id's 
    for j in (range(len(df_distance))):

        # "a" is the id we are comparing with
        a = df_distance.iloc[i,2:]        

        # "b" is the id we are selecting to compare with
        b = df_distance.iloc[index,2:]

        value = euclidean_dist(a,b)

        # Create a temp dictionary to load the data into dataframe
        dict = {
            'id': df_distance['id'][i], 
            'id_match': df_distance['id'][index], 
            'similarity_distance':value
        }


        df_50 = df_50.append(dict,ignore_index=True)

        # if the b values are less (nearer to the end of the array)
        # in that case we reset the "index" value to 0 so as to continue the comparsision of "b" with "a".
        if index == len(df_distance)-1:
            index = 0
        else:
            index +=1

    # Append the content of "df_50" into "df_similar" once for the iteration of "i"
    df_similar = df_similar.append(df_50,ignore_index=True)

I guess more time consuming for me is in the for Loops.

Euclidean distance function I am using in my code.

from sklearn.metrics.pairwise import euclidean_distances
def euclidean_dist(a, b):
        euclidean_val = euclidean_distances([a, b])
        value = euclidean_val[0][1]
        return value

Sample df_distance data Note: In the image the values are scaled from column locality till end and we are using only this values to calculate the distance

enter image description here

Output format to be in this format below. enter image description here

9
  • I think this may answer your question. stackoverflow.com/questions/32946241/… Commented Apr 24, 2022 at 14:25
  • what is tqdm? Commented Apr 24, 2022 at 14:32
  • can you show some example data in df_distance and the euclidean_dist function? Commented Apr 24, 2022 at 14:36
  • @Stuart tqdm is used to display progress bar in jupyter notebook. Commented Apr 24, 2022 at 16:56
  • @Stuart I have added the snapshot in the question and euclidean_dist function is also added there now. Thanks! Commented Apr 24, 2022 at 17:01

2 Answers 2

3

try using numpy instead, do some thing like this:

import pandas as pd
import numpy as np 

def numpy_euclidian_distance(point_1, point_2):
    array_1, array_2 = np.array(point_1), np.array(point_2)
    squared_distance = np.sum(np.square(array_1 - array_2))
    distance = np.sqrt(squared_distance)
    return distance 
    
    
# initialise data of lists.
data = {'num1':[1, 2, 3, 4], 'num2':[20, 21, 19, 18]}
 
# # Create DataFrame
df = pd.DataFrame(data)

# calculate distance of the hole number at ones using numpy 
distance = numpy_euclidian_distance(df.iloc[:,0],df.iloc[:,1])
print(distance)
Sign up to request clarification or add additional context in comments.

5 Comments

This will still require looping through.
The problem is that OP has 13000 rows. Your code is to calculate the distance between row1 and row 2. Not between every row to every other row as OP requested
the Answer provided below might solve the issue
@onyambu yes you right the problem for me is in calculating it for 13000 rows
@Mahsaga your code can be optimized and calculate less than half of what you did. you should know that : the distance between 2 points a and b is the same as the distance between b and a. so you only need to calculate 1 of them.
1

OK, so from comments I take it you want the top 50 distances, which is faster as a single step using KDTree. As a warning, KDTree will only be faster than brute force for columns**2 < rows, so of you have more then 13 rows, there may be faster ways to implement, but this will still likely be the simplest:

from scipy.spatial import KDTree
X = df_distance.values
X_tree = KDTree(X)
k_d, k_i = X_tree.query(X, k = 50)  # shape of each is (13k, 50)

k_i[i] will then be a list of indices of the 50 closest points to the point at index i with 0 <= i < 13000, and k_d[i] will be the corresponding distances.

EDIT: this should get the dataframe you want, using multi-index:

df_d = {
        idx: {
              df_distance['id'][k_i[i, j]]: d for j, d in enumerate(k_d[i])
              } for i, idx in enumerate(df_distance['id'])
        }
out = pd.dataframe(df_d).T

2 Comments

thanks for your reply but I am not sure how to get the Output as in required format I have mentioned in question above. To get output in that format I am using nested for loops which is time consuming
I think I have generated a dataframe, but haven't the time to test now.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.