0

I have two table in SQL

tModel which has 9 columns (ID, Date N1, N2, N3, N4, N5, N6, Flag) and 300 million rows

tPairAP which has 3 columns (ID, N1, N2) and 750 rows

The task I need to to perform is to find if any rows on tModel for N1, N2, N3, N4, N5, N6 contain N1 and N2 from tPairAP

Despite the large table I first tried to run a query inside SQL using the following code

WITH Subquery AS (
        SELECT t1.ID
        FROM tModel t1
        RIGHT JOIN tPairAP t2 ON (
                (t1.N1 = t2.N1 OR t1.N2 = t2.N1 OR t1.N3 = t2.N1 OR t1.N4 = t2.N1 OR t1.N5 = t2.N1 OR t1.N6 = t2.N1)
                AND (t1.N1 = t2.N2 OR t1.N2 = t2.N2 OR t1.N3 = t2.N2 OR t1.N4 = t2.N2 OR t1.N5 = t2.N2 OR t1.N6 = t2.N2)
            )
        WHERE t1.N1 IS NOT NULL
    )

    UPDATE tModel
    SET Flag= CASE WHEN Subquery.ID IS NOT NULL THEN 'Y' ELSE 'N' END
    FROM tModel
    LEFT JOIN Subquery ON tModel.ID = Subquery.ID;

Due to the size of the table the query took 20hr and 30min to run.

Due to the length I thought running that check on python would have been fast and therefore, in order to check the difference in run time I thought to consider two ways of doing that.

The first I tried route it has been to use Numpay. To do so I have created txt files so that could have been loaded as Numpy arrays. I named the two file as the table and they only contains N1, N2, N3, N4, N5, N6 for tModel.txt and N1, N2 for tPairAP.txt.

I have then also thought to code two different way for Numpy to handle this task

NUMPY OPTION 1

import numpy as np

# Load the text file as numpy arrays
tModel = np.loadtxt('C:\\py\\SQL ON\\py_files\\tModel.txt', delimiter=','))
tPairAP = np.loadtxt('C:\\py\\SQL ON\\py_files\\tPairAP.txt', delimiter=','))


# Check if any arrays of Array1 contain all elements from any arrays of Array2
contains_elements = np.array([np.all(np.isin(tModel, arr)) for arr in tPairAP])

# Create a list of flags based on the contains_elements array
flags = np.where(contains_elements, 'Y', 'N')

# Write the list of lists of List1 and flags to a new file
with open('C:\\py\\SQL ON\\tModelNoPair.txt', 'w') as file:
    for i in range(len(tModel)):
        file.write(str(tModel[i]) + ' ' + flags[i] + '\n')

With a huge surprise I found out that running the python code takes ages and longer than the time taken by SQL. I had the above code running for 33hrs and the code wasn't even close to complete the run

NUMPY OPTION 2

import numpy as np

# Convert List1 and List2 to numpy arrays
tModel = np.loadtxt('C:\\py\\SQL ON\\py_files\\tModel.txt', delimiter=','))
tPairAP = np.loadtxt('C:\\py\\SQL ON\\py_files\\tPairAP.txt', delimiter=','))

# Create a list to store the flags
flags = []

# Iterate over each list in List1
for l1 in tModel:
    flag = 'N'
    # Iterate over each list in List2
    for l2 in tPairAP:
        if np.isin(l1, l2).sum() == 2:
            flag = 'Y'
            break
    flags.append(flag)

# Write the results to a file
with open('C:\\py\\SQL ON\\tModelNoPair.txt', 'w') as file:
    for i, flag in enumerate(flags):
        file.write(f'{tModel[i]} {flag}\n')

This second option I did not even tried to run it as I think this is more of a generic version that I can use in case the tPairAP has, for example, not only N1, N2 but also N3, N4..Nn. However, despite my assumption on how fast the code above will be, given the length of NUMPY OPTION 1 I do not think it will run faster than the SQL query.

The third option I have thought about was to run a script via python list. I have the written a code to do as follow:

LIST OPTION

import numpy as np
def check_elements(tModel, tPairAP):
    for sublist1 in tModel:
        for sublist2 in tPairAP:
            if all(elem in sublist1 for elem in sublist2):
                return 'Y'
    return 'N'

# lists
tModel = np.loadtxt('C:\\py\\SQL ON\\py_files\\tModel.txt', delimiter=',')
tPairAP = np.loadtxt('C:\\py\\SQL ON\\py_files\\tPairAP.txt', delimiter=',')

flags = []
for sublist1 in tModel:
    flag = check_elements(tModel, tPairAP)
    flags.append(flag)

with open('output.txt', 'w') as file:
for sublist, flag in zip(tModel, flags):
    file.write(str(sublist) + ' ' + flag + '\n')

Now, nedless to say but before even kicking LIST OPTION I have done some research on internet to understand if lists are faster than numpy and it seems that numpay it is faster to check through and within arrays than list. Due to this I have not run the code as I will not expect to be any faster than Numpy.

So my question would be: how is it possible numpy is slower than SQL? To my knowledge despite not being a python expert (I am using numpy for other stuff like very large variance covariance matrix) numpy is supposed to be faster but it looks like it is not.

Can anyone let me know if my python code is badly written and eventually how to correct it in order to make it fast and faster than SQL or if there is a better library to perform this task and eventually an example on how to use that library?

Thanks

+++UPDATE+++ As suggested by @Grismar my python code (NUMPY OPTION 1 and 2) where not optimal. In actual fact they were not correct as they were not providing the correct output. I've been too overconfident my code was correct without testing it in a small sample.

I have therefore correct NUMPY OTION 1 using two sample small arry to make sure the output it was the correct one (the one I was looking for)

import numpy as np

# Load the text file as NumPy arrays
tModel = np.array([[1, 3, 2, 4, 5, 6], [7, 8, 9, 10, 11, 12],[1,3,5,7,9,11]])
tPairAP = np.array([[3, 4], [9, 10],[1,10]])
#tModel = np.array(np.loadtxt('C:\\py\\SQL ON\\py_files\\tModel.txt', delimiter=','))
#tPairAP = np.array(np.loadtxt('C:\\py\\SQL ON\\py_files\\tPairAP.txt', delimiter=','))

# Perform the right join operation
result = []
for row_t1 in tModel:
    for row_t2 in tPairAP:
        if ((row_t1[0] == row_t2[0] or row_t1[1] == row_t2[0] or row_t1[2] == row_t2[0] or row_t1[3] == row_t2[0] or row_t1[4] == row_t2[0] or row_t1[5] == row_t2[0]) and
            (row_t1[0] == row_t2[1] or row_t1[1] == row_t2[1] or row_t1[2] == row_t2[1] or row_t1[3] == row_t2[1] or row_t1[4] == row_t2[1] or row_t1[5] == row_t2[1])):
            result.append(row_t1)

result = np.array(result)

# Perform the right join operation
result = []
for row_t1 in tModel:
    flag = 'N'
    for row_t2 in tPairAP:
        if all(elem in row_t1 for elem in row_t2):
            flag = 'Y'
            break
    result.append(np.append(row_t1, flag))

result = np.array(result)
for row in result:
    print(row)

The output is now the following

['1' '3' '2' '4' '5' '6' 'Y']
['7' '8' '9' '10' '11' '12' 'Y']
['1' '3' '5' '7' '9' '11' 'N']

Now even this version take longer than SQL and according to @Grismar this would be expected.

I have done some reserach and I understand there is a library called intertools that could provide to be much faster than SQL and NUMPY.

Is there anyone that knows how to achieve what I have achieved with SQL and NUMPY by using INTERTOOLS?

2
  • 1
    "find if any rows " -> do you want to find if any rows exist, or find all the rows that match your criteria? Commented Apr 17, 2024 at 10:41
  • Hi @snakecharmerb thanks for the comment. To clarify my question and what I am looking for: as for the SQL does I am looking to find all the rows that match my criteria. For those in tModel matching my criteria it will return the tModel row appending the flag 'Y' to the end of the row, while for those in tModel not matching my criteria it will return the tModel row appending the flag 'N' to the end of the row, Commented Apr 17, 2024 at 11:02

2 Answers 2

1

Algorithm: This may be a bit quicker as it takes advantage of large table left joining onto smaller table over several steps. It assumes that tModel ID is unique for each row. Due to the number of rows it may be worth running in stages (separated by ;). Pivot the small table tPairAP so that N1 and N2 are stacked vertically into column Ny - save it as tPairAPx. Then we can test if tModel t1.N1 is in tPairAPx column Ny, and bring back the other 5 (N2, N3, N4, N5, N6) as (A1, A2, A3, A4, A5) along with Nx. Then repeat test for tModel t1.N2 and bring back (N1, N3, N4, N5, N6) as (A1, A2, A3, A4, A5) along with Nx. Repeat these tests for N3, N4, N5 and N6. Union these test tables and select where Nx is not null to give a smaller table - Step1.

Then test whether each of A1, A2, A3, A4, A5 are in tPairAPx column Ny and bring back a smaller table Step2 where T2_ID is not null. Then tModel left join Step2 with a case when statement to give the result.

create table tPairAPx
select t1.ID, "N1" as Nx, t1.N1 as Ny from tPairAP t1
union 
select t2.ID, "N2" as Nx, t2.N2 as Ny from tPairAP t2;

create table Step1
select Id, A1, A2, A3, A4, A5 from
(
      select t1.Id, t1.N2 as A1, t1.N3 as A2, t1.N4 as A3, t1.N5 as A4, t1.N6 as A5, t2.Nx
FROM tModel t1 LEFT JOIN tPairAPx t2 on t1.N1 = t2.Ny
union select t1.Id, t1.N1 as A1, t1.N3 as A2, t1.N4 as A3, t1.N5 as A4, t1.N6 as A5, t2.Nx
FROM tModel t1 LEFT JOIN tPairAPx t2 on t1.N2 = t2.Ny
union select t1.Id, t1.N1 as A1, t1.N2 as A2, t1.N4 as A3, t1.N5 as A4, t1.N6 as A5, t2.Nx
FROM tModel t1 LEFT JOIN tPairAPx t2 on t1.N3 = t2.Ny
union select t1.Id, t1.N1 as A1, t1.N2 as A2, t1.N3 as A3, t1.N5 as A4, t1.N6 as A5, t2.Nx
FROM tModel t1 LEFT JOIN tPairAPx t2 on t1.N4 = t2.Ny
union select t1.Id, t1.N1 as A1, t1.N2 as A2, t1.N3 as A3, t1.N4 as A4, t1.N6 as A5, t2.Nx
FROM tModel t1 LEFT JOIN tPairAPx t2 on t1.N5 = t2.Ny
union select t1.Id, t1.N1 as A1, t1.N2 as A2, t1.N3 as A3, t1.N4 as A4, t1.N5 as A5, t2.Nx
FROM tModel t1 LEFT JOIN tPairAPx t2 on t1.N6 = t2.Ny
)
where Nx is not null;

create table Step2
select t1.ID from (
select t1.Id, t2.ID as T2_ID from Step1 t1 LEFT JOIN tPairAPx t2 on t1.A1 = t2.Ny
union
select t1.Id, t2.ID as T2_ID from Step1 t1 LEFT JOIN tPairAPx t2 on t1.A2 = t2.Ny
union
select t1.Id, t2.ID as T2_ID from Step1 t1 LEFT JOIN tPairAPx t2 on t1.A3 = t2.Ny
union
select t1.Id, t2.ID as T2_ID from Step1 t1 LEFT JOIN tPairAPx t2 on t1.A4 = t2.Ny
union
select t1.Id, t2.ID as T2_ID from Step1 t1 LEFT JOIN tPairAPx t2 on t1.A5 = t2.Ny
)
where T2_ID is not null;

select t1.ID, t1.Date, t1.N1, t1.N2, t1.N3, t1.N4, t1.N5, t1.N6,
case when t1.ID in (select ID from Step2) then "Y"
else "N" end as Flag
from tModel t1 left join Step2 t2 on t1.ID = t2.ID
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! I will definitely try your code to see if I can reduce the the running time of the qry
1

Why SQL is faster than numpy in this case can be due to several reasons. You didn't mention what RDMB or engine you're running this SQL on, but even assuming SQLite, all it needs to do is load the tables and then perform the same operation you'd try to create with numpy, except that it's already optimised for exactly that type of task.

However, a second reason is that your numpy solution isn't optimal. An improvement might be:

import numpy as np

# only load the relevant columns, avoiding datatype issues
tModel = np.genfromtxt(r'C:\py\SQL ON\py_files\tModel.txt', delimiter=',',
                       encoding='utf-8', dtype=None, usecols=[0, 2, 3])
tPairAP = np.genfromtxt(r'C:\py\SQL ON\py_files\tPairAP.txt', delimiter=',', 
                        encoding='utf-8')

# reshape for broadcasting
tModel_ext = np.expand_dims(tModel, axis=1)
matches = np.any(np.all(tModel_ext == tPairAP, axis=2), axis=1)

np.savetxt(r'C:\py\SQL ON\py_files\tModelNoPair.txt', 
           tModel[matches], fmt='%s', delimiter=',')

That solution only writes the matching records.

If you prefer to write all the data with 'Y'/'N' flags:

np.savetxt(r'C:\py\SQL ON\py_files\tModelNoPair_flags.txt', 
           np.c_[tModel, np.where(matches, 'Y', 'N')], fmt='%s', delimiter=',')

4 Comments

Thanks for this. I am using SQL server fully licensed. Just to understand, are you saying there will be no gain in speed to run this task in python? If so I'll be happy to use SQL
You might be able to get it to work as fast or perhaps even a bit faster if you find an optimal solution using numpy or perhaps something like xarray, but it won't be a lot faster. So unless you need to run this script very often, it's probably not worth the effort. If performance is key, you would do well to look at more optimised solutions in a compiled language. (Mojo, Rust, C++, etc.)
As for SQL Server, that's a very capable RDMB, but a bit overkill with a bunch of overhead if you just have two .txt data files that need processing. You could see if using SQLite locally instead would get you better results, providing you have enough free memory to run the whole thing with your dataset.
Hi, and thanks again for your comment. I have just updated my original question. I coded something different from urs to mimic the SQL. Still it is not an optimal solution. Tried to run your code but I do get an error matches = np.any(np.all(tModel_ext == tPairAP, axis=2), axis=1) File "C:\\numpy\core\fromnumeric.py", line 86, in _wrapreduction return ufunc.reduce(obj, axis, dtype, out, **passkwargs) numpy.AxisError: axis 2 is out of bounds for array of dimension 0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.