18

I have 2 fixed width files like below (only change is Date value starting at position 14).

sample_hash1.txt

GOKULKRISHNA 04/17/2018
ABCDEFGHIJKL 04/17/2018
111111111111 04/17/2018

sample_hash2.txt

GOKULKRISHNA 04/16/2018
ABCDEFGHIJKL 04/16/2018
111111111111 04/16/2018

Using pandas read_fwf I am reading this file and creating a Dataframe by excluding the Date value and loading only the first 13 characters. My dataframe looks like this

import pandas as pd
df1 = pd.read_fwf("sample_hash1.txt", colspecs=[(0,13)])
df2 = pd.read_fwf("sample_hash2.txt", colspecs=[(0,13)])

df1

   GOKULKRISHNA
0  ABCDEFGHIJKL
1  111111111111
...

df2

   GOKULKRISHNA
0  ABCDEFGHIJKL
1  111111111111
...

Now I am trying to generate a hash value on each Dataframe, but the hash is different for df1 and df2. I'm not sure what's wrong with this. Can someone throw some light on this please? I have to identify if there is any change in data between the files (excluding the Date columns).

print(hash(df1.values.tostring()))
-3571422965125408226

print(hash(df2.values.tostring()))
5039867957859242153

I am loading these files into a table (each full file is around 2 GB size). Every time we are receiving full files from source. Sometimes there is no change in the data (excluding the last column, Date). My idea is to reject such files. If I can generate a hash on the file and store it somewhere (in a table) next time I can compare the new file hash value with the stored hash. I thought this is the right approach but I got stuck with hash generation.

I checked this post Most efficient property to hash for numpy array but that is not what I am looking for.

9
  • 2
    The hash will be different for different object. Both dataframe are not the same. Try df1.values.tostring() == df2.values.tostring(), it should be false. If you want to have the same hash, you need to remove the data in the values before taking the hash. Commented Apr 17, 2018 at 16:37
  • 1
    yes it is False. Is there any other way i can geneate a unique code based on the data in the file? (excluding some part of the data) Commented Apr 17, 2018 at 16:41
  • 1
    you can try: hash(df1[:-1].values.tostring()) to remove the last column. Commented Apr 17, 2018 at 16:54
  • 2
    Possible duplicate of Most efficient property to hash for numpy array Commented Apr 17, 2018 at 17:08
  • 1
    @TwistedSim last column is not in the dataframe anyway. i am loading first 13 characters only Commented Apr 17, 2018 at 17:14

4 Answers 4

28

You can now use pd.util.hash_pandas_object

hashlib.sha1(pd.util.hash_pandas_object(df).values).hexdigest() 

For a dataframe with 50 million rows, this method took me 10 seconds versus over a minute for the to_json() method.

Sign up to request clarification or add additional context in comments.

3 Comments

how would you return a single hash for the entire dataframe though?
This answer worked well for me. pd.util.hash_pandas_object(df) on it's own returned a series (I believe it is the length of the number of rows in the dataframe), but using the full answer provided (putting inside hashlib.sha1(...) resulted in a single hash for the dataframe for me (hashlib.sha1(pd.util.hash_pandas_object(df).values).hexdigest() )
But this works only for the "content" of the dataframe and not its meta data like column and row names.
5

Use string representation dataframe.

import hashlib

print(hashlib.sha256(df1.to_json().encode()).hexdigest())
print(hashlib.sha256(df2.to_json().encode()).hexdigest())

or

print(hashlib.sha256(df1.to_csv().encode()).hexdigest())
print(hashlib.sha256(df2.to_csv().encode()).hexdigest())

3 Comments

Awesome. This is working. But i think hash generation will be slow on big files ?
Do you know why running the example in different run it give different hash ? Do you know how to get the same hash for the same dataframe during different run of the code ?
This method is slow.
4

The other answers here do forget the column names (column index) of a dataframe. The pd.util.hash_pandas_object() create a series of hash values for each row of a dataframe including it's index (the row name). But the name of columns doesn't matter as you can see here:

>>> from pandas import *
>>> from pandas import util
>>> util.hash_pandas_object(DataFrame({'A': [1,2,3], 'B': [4,5,6]}))
0     580038878669277522
1    2529894495349307502
2    4389717532997776129
dtype: uint64
>>> util.hash_pandas_object(DataFrame({'Foo': [1,2,3], 'Bar': [4,5,6]}))
0     580038878669277522
1    2529894495349307502
2    4389717532997776129
dtype: uint64

My solution

df = DataFrame({'A': [1,2,3], 'B': [4,5,6]})

hash = pandas.util.hash_pandas_object(df)

# append hash of the columns
hash = pandas.concat(
    [hash, pandas.util.hash_pandas_object(df.columns)]
)

# hash the series of hashes
hash = hashlib.sha1(hash.values).hexdigest()

print(hash)

Comments

2

In addition to other answers - simple and fast checksum:

checksum = pandas.util.hash_pandas_object(df).sum()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.