0

I need an algorithm or method of computing a checksum of a sql column that is easy to replicate in python given a csv.

I want to verify that the csv column and the sql column match


I have a scheme summing the binary_checksums of each row in the column on the sql side and python side into two overall column sums but I'm worried about collisions and I want to know if there's a faster or better way.


I need a function where c is a full sql column such that


python_function(c) == pyquery("EXEC sql_function("some_table",c))

where python_function(c) and sql_column(c) return something like a hash or checksum

one function doesn't need to encompass all possible types of c either. although it's a plus if it does. you could give a scheme specific to varchars or ints or bytes, etc.

The csv's will be large almost 50 million rows 66 columns (varchar, ints, bits,smallints, decimal, numeric).

The csv comes from an external source and I need to verify that it matches the data in my database.

doesn't need to be 100% accurate missing less than 100,000 rows of difference is fine

As an example here's a high level implementation of my solution in python pseudocode.

def is_likely_equal(csv_filename, column_name):
    column_data = get_column_data(csv_filename,column)
    # I know it won't fit in memory this is an example

    python_b_sum = get_b_sum(column_data) 
    sql_b_sum = some_db.execute("SELECT SUM(BINARY_CHECKSUM(column_name)) FROM table")

    if python_b_sum == sql_b_sum:
         return True
    else:
         return False
   
def get_b_sum(column_data):
    b_sum = 0
    for entry in column_data:
        b_sum += b_checksum(entry)
    return b_sum
 

11
  • 1
    What is the actual purpose of having a checksum? Commented Jun 30, 2020 at 22:15
  • So I can compare the checksum of what I generate with python to the checksum of what I generate with SQL. And If they match I know both columns are the same. I'll update the question Commented Jun 30, 2020 at 22:19
  • All checksums have the possibility of collision. Some more than others, but it's inescapable. Commented Jun 30, 2020 at 22:21
  • Then why are you concerned with collisions? Commented Jun 30, 2020 at 22:22
  • @AlwaysLearning I know. this is for data qualtiy purposes. reducing the chances of an undetected fault to the chances of a checksum collision is fine for me. Commented Jun 30, 2020 at 22:24

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.