I have two table in SQL
tModel which has 9 columns (ID, Date N1, N2, N3, N4, N5, N6, Flag) and 300 million rows
tPairAP which has 3 columns (ID, N1, N2) and 750 rows
The task I need to to perform is to find if any rows on tModel for N1, N2, N3, N4, N5, N6 contain N1 and N2 from tPairAP
Despite the large table I first tried to run a query inside SQL using the following code
WITH Subquery AS (
SELECT t1.ID
FROM tModel t1
RIGHT JOIN tPairAP t2 ON (
(t1.N1 = t2.N1 OR t1.N2 = t2.N1 OR t1.N3 = t2.N1 OR t1.N4 = t2.N1 OR t1.N5 = t2.N1 OR t1.N6 = t2.N1)
AND (t1.N1 = t2.N2 OR t1.N2 = t2.N2 OR t1.N3 = t2.N2 OR t1.N4 = t2.N2 OR t1.N5 = t2.N2 OR t1.N6 = t2.N2)
)
WHERE t1.N1 IS NOT NULL
)
UPDATE tModel
SET Flag= CASE WHEN Subquery.ID IS NOT NULL THEN 'Y' ELSE 'N' END
FROM tModel
LEFT JOIN Subquery ON tModel.ID = Subquery.ID;
Due to the size of the table the query took 20hr and 30min to run.
Due to the length I thought running that check on python would have been fast and therefore, in order to check the difference in run time I thought to consider two ways of doing that.
The first I tried route it has been to use Numpay. To do so I have created txt files so that could have been loaded as Numpy arrays. I named the two file as the table and they only contains N1, N2, N3, N4, N5, N6 for tModel.txt and N1, N2 for tPairAP.txt.
I have then also thought to code two different way for Numpy to handle this task
NUMPY OPTION 1
import numpy as np
# Load the text file as numpy arrays
tModel = np.loadtxt('C:\\py\\SQL ON\\py_files\\tModel.txt', delimiter=','))
tPairAP = np.loadtxt('C:\\py\\SQL ON\\py_files\\tPairAP.txt', delimiter=','))
# Check if any arrays of Array1 contain all elements from any arrays of Array2
contains_elements = np.array([np.all(np.isin(tModel, arr)) for arr in tPairAP])
# Create a list of flags based on the contains_elements array
flags = np.where(contains_elements, 'Y', 'N')
# Write the list of lists of List1 and flags to a new file
with open('C:\\py\\SQL ON\\tModelNoPair.txt', 'w') as file:
for i in range(len(tModel)):
file.write(str(tModel[i]) + ' ' + flags[i] + '\n')
With a huge surprise I found out that running the python code takes ages and longer than the time taken by SQL. I had the above code running for 33hrs and the code wasn't even close to complete the run
NUMPY OPTION 2
import numpy as np
# Convert List1 and List2 to numpy arrays
tModel = np.loadtxt('C:\\py\\SQL ON\\py_files\\tModel.txt', delimiter=','))
tPairAP = np.loadtxt('C:\\py\\SQL ON\\py_files\\tPairAP.txt', delimiter=','))
# Create a list to store the flags
flags = []
# Iterate over each list in List1
for l1 in tModel:
flag = 'N'
# Iterate over each list in List2
for l2 in tPairAP:
if np.isin(l1, l2).sum() == 2:
flag = 'Y'
break
flags.append(flag)
# Write the results to a file
with open('C:\\py\\SQL ON\\tModelNoPair.txt', 'w') as file:
for i, flag in enumerate(flags):
file.write(f'{tModel[i]} {flag}\n')
This second option I did not even tried to run it as I think this is more of a generic version that I can use in case the tPairAP has, for example, not only N1, N2 but also N3, N4..Nn. However, despite my assumption on how fast the code above will be, given the length of NUMPY OPTION 1 I do not think it will run faster than the SQL query.
The third option I have thought about was to run a script via python list. I have the written a code to do as follow:
LIST OPTION
import numpy as np
def check_elements(tModel, tPairAP):
for sublist1 in tModel:
for sublist2 in tPairAP:
if all(elem in sublist1 for elem in sublist2):
return 'Y'
return 'N'
# lists
tModel = np.loadtxt('C:\\py\\SQL ON\\py_files\\tModel.txt', delimiter=',')
tPairAP = np.loadtxt('C:\\py\\SQL ON\\py_files\\tPairAP.txt', delimiter=',')
flags = []
for sublist1 in tModel:
flag = check_elements(tModel, tPairAP)
flags.append(flag)
with open('output.txt', 'w') as file:
for sublist, flag in zip(tModel, flags):
file.write(str(sublist) + ' ' + flag + '\n')
Now, nedless to say but before even kicking LIST OPTION I have done some research on internet to understand if lists are faster than numpy and it seems that numpay it is faster to check through and within arrays than list. Due to this I have not run the code as I will not expect to be any faster than Numpy.
So my question would be: how is it possible numpy is slower than SQL? To my knowledge despite not being a python expert (I am using numpy for other stuff like very large variance covariance matrix) numpy is supposed to be faster but it looks like it is not.
Can anyone let me know if my python code is badly written and eventually how to correct it in order to make it fast and faster than SQL or if there is a better library to perform this task and eventually an example on how to use that library?
Thanks
+++UPDATE+++ As suggested by @Grismar my python code (NUMPY OPTION 1 and 2) where not optimal. In actual fact they were not correct as they were not providing the correct output. I've been too overconfident my code was correct without testing it in a small sample.
I have therefore correct NUMPY OTION 1 using two sample small arry to make sure the output it was the correct one (the one I was looking for)
import numpy as np
# Load the text file as NumPy arrays
tModel = np.array([[1, 3, 2, 4, 5, 6], [7, 8, 9, 10, 11, 12],[1,3,5,7,9,11]])
tPairAP = np.array([[3, 4], [9, 10],[1,10]])
#tModel = np.array(np.loadtxt('C:\\py\\SQL ON\\py_files\\tModel.txt', delimiter=','))
#tPairAP = np.array(np.loadtxt('C:\\py\\SQL ON\\py_files\\tPairAP.txt', delimiter=','))
# Perform the right join operation
result = []
for row_t1 in tModel:
for row_t2 in tPairAP:
if ((row_t1[0] == row_t2[0] or row_t1[1] == row_t2[0] or row_t1[2] == row_t2[0] or row_t1[3] == row_t2[0] or row_t1[4] == row_t2[0] or row_t1[5] == row_t2[0]) and
(row_t1[0] == row_t2[1] or row_t1[1] == row_t2[1] or row_t1[2] == row_t2[1] or row_t1[3] == row_t2[1] or row_t1[4] == row_t2[1] or row_t1[5] == row_t2[1])):
result.append(row_t1)
result = np.array(result)
# Perform the right join operation
result = []
for row_t1 in tModel:
flag = 'N'
for row_t2 in tPairAP:
if all(elem in row_t1 for elem in row_t2):
flag = 'Y'
break
result.append(np.append(row_t1, flag))
result = np.array(result)
for row in result:
print(row)
The output is now the following
['1' '3' '2' '4' '5' '6' 'Y']
['7' '8' '9' '10' '11' '12' 'Y']
['1' '3' '5' '7' '9' '11' 'N']
Now even this version take longer than SQL and according to @Grismar this would be expected.
I have done some reserach and I understand there is a library called intertools that could provide to be much faster than SQL and NUMPY.
Is there anyone that knows how to achieve what I have achieved with SQL and NUMPY by using INTERTOOLS?