How to optimize my code to calculate Euclidean distance

Question

I am trying to find Euclidean distance between two points. I have around 13000 number of rows in Dataframe. I have to find Euclidean distance for each each row against all 13000 number of rows and then get the similarity scores for that. Running the code is more time consuming (more than 24 hrs).

Below is my code:

# Empty the existing database
df_similar = pd.DataFrame()
print(df_similar)

# 'i' refers all id's in the dataframe
# Length of df_distance is 13000

for i in tqdm(range(len(df_distance))):
    df_50 = pd.DataFrame(columns=['id', 'id_match', 'similarity_distance'])

    # in Order to avoid the duplicates we each time assign the "index" value with "i" so that we are starting the 
    # comparision from that index of "i" itself.
    if i < len(df_distance):
        index = i

    # This loop is used to iterate one id with all 13000 id's. This is time consuming as we have to iterate each id against all 13000 id's 
    for j in (range(len(df_distance))):

        # "a" is the id we are comparing with
        a = df_distance.iloc[i,2:]        

        # "b" is the id we are selecting to compare with
        b = df_distance.iloc[index,2:]

        value = euclidean_dist(a,b)

        # Create a temp dictionary to load the data into dataframe
        dict = {
            'id': df_distance['id'][i], 
            'id_match': df_distance['id'][index], 
            'similarity_distance':value
        }


        df_50 = df_50.append(dict,ignore_index=True)

        # if the b values are less (nearer to the end of the array)
        # in that case we reset the "index" value to 0 so as to continue the comparsision of "b" with "a".
        if index == len(df_distance)-1:
            index = 0
        else:
            index +=1

    # Append the content of "df_50" into "df_similar" once for the iteration of "i"
    df_similar = df_similar.append(df_50,ignore_index=True)

I guess more time consuming for me is in the for Loops.

Euclidean distance function I am using in my code.

from sklearn.metrics.pairwise import euclidean_distances
def euclidean_dist(a, b):
        euclidean_val = euclidean_distances([a, b])
        value = euclidean_val[0][1]
        return value

Sample df_distance data Note: In the image the values are scaled from column locality till end and we are using only this values to calculate the distance

Output format to be in this format below.

I think this may answer your question. stackoverflow.com/questions/32946241/… — rayryeng
– rayryeng, Commented Apr 24, 2022 at 14:25
can you show some example data in df_distance and the euclidean_dist function? — Stuart
– Stuart, Commented Apr 24, 2022 at 14:36
@Stuart tqdm is used to display progress bar in jupyter notebook. — Mahsaga
– Mahsaga, Commented Apr 24, 2022 at 16:56
@Stuart I have added the snapshot in the question and euclidean_dist function is also added there now. Thanks! — Mahsaga
– Mahsaga, Commented Apr 24, 2022 at 17:01

bara-elba · Accepted Answer · 2022-04-24 14:28:40Z

3

try using numpy instead, do some thing like this:

import pandas as pd
import numpy as np 

def numpy_euclidian_distance(point_1, point_2):
    array_1, array_2 = np.array(point_1), np.array(point_2)
    squared_distance = np.sum(np.square(array_1 - array_2))
    distance = np.sqrt(squared_distance)
    return distance 
    
    
# initialise data of lists.
data = {'num1':[1, 2, 3, 4], 'num2':[20, 21, 19, 18]}
 
# # Create DataFrame
df = pd.DataFrame(data)

# calculate distance of the hole number at ones using numpy 
distance = numpy_euclidian_distance(df.iloc[:,0],df.iloc[:,1])
print(distance)

answered Apr 24, 2022 at 14:28

bara-elba

1466 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Onyambu Over a year ago

This will still require looping through.

Onyambu Over a year ago

The problem is that OP has 13000 rows. Your code is to calculate the distance between row1 and row 2. Not between every row to every other row as OP requested

Onyambu Over a year ago

the Answer provided below might solve the issue

Mahsaga Over a year ago

@onyambu yes you right the problem for me is in calculating it for 13000 rows

bara-elba Over a year ago

@Mahsaga your code can be optimized and calculate less than half of what you did. you should know that : the distance between 2 points a and b is the same as the distance between b and a. so you only need to calculate 1 of them.

Daniel F · Accepted Answer · 2022-04-26 09:07:53Z

1

OK, so from comments I take it you want the top 50 distances, which is faster as a single step using KDTree. As a warning, KDTree will only be faster than brute force for columns**2 < rows, so of you have more then 13 rows, there may be faster ways to implement, but this will still likely be the simplest:

from scipy.spatial import KDTree
X = df_distance.values
X_tree = KDTree(X)
k_d, k_i = X_tree.query(X, k = 50)  # shape of each is (13k, 50)

k_i[i] will then be a list of indices of the 50 closest points to the point at index i with 0 <= i < 13000, and k_d[i] will be the corresponding distances.

EDIT: this should get the dataframe you want, using multi-index:

df_d = {
        idx: {
              df_distance['id'][k_i[i, j]]: d for j, d in enumerate(k_d[i])
              } for i, idx in enumerate(df_distance['id'])
        }
out = pd.dataframe(df_d).T

edited Apr 26, 2022 at 9:07

answered Apr 25, 2022 at 10:58

Daniel F

14.5k2 gold badges34 silver badges59 bronze badges

2 Comments

Mahsaga Over a year ago

thanks for your reply but I am not sure how to get the Output as in required format I have mentioned in question above. To get output in that format I am using nested for loops which is time consuming

Daniel F Over a year ago

I think I have generated a dataframe, but haven't the time to test now.

Collectives™ on Stack Overflow

How to optimize my code to calculate Euclidean distance

2 Answers 2

5 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related