How to compute hash of all the columns in Pandas Dataframe?

Question

df.apply is a method that can apply a certain function to all the columns in a dataframe, or the required columns. However, my aim is to compute the hash of a string: this string is the concatenation of all the values in a row corresponding to all the columns. My current code is returning NaN.

The current code is:

df["row_hash"] = df["row_hash"].apply(self.hash_string)

The function self.hash_string is:

def hash_string(self, value):
        return (sha1(str(value).encode('utf-8')).hexdigest())

Yes, it would be easier to merge all columns of Pandas dataframe but current answer couldn't help me either.

The file that I am reading is(the first 10 rows):

16012,16013,16014,16015,16016,16017,16018,16019,16020,16021,16022
16013,16014,16015,16016,16017,16018,16019,16020,16021,16022,16023
16014,16015,16016,16017,16018,16019,16020,16021,16022,16023,16024
16015,16016,16017,16018,16019,16020,16021,16022,16023,16024,16025
16016,16017,16018,16019,16020,16021,16022,16023,16024,16025,16026

The col names are: col_test_1, col_test_2, .... , col_test_11

Just to clarify, the non-hashed value of the first row should be something like 160121601316014...? — Andrew F
– Andrew F, Commented Feb 4, 2019 at 15:17
@AndrewF Yes! It is the concatenation of the values of all columns in the same order as in the file and then the hash of that concatenated string. — Aviral Srivastava
– Aviral Srivastava, Commented Feb 4, 2019 at 15:19

vital_dml · Accepted Answer · 2019-02-04 15:20:24Z

4

You can create a new column, which is concatenation of all others:

df['new'] = df.astype(str).values.sum(axis=1)

And then apply your hash function on it

df["row_hash"] = df["new"].apply(self.hash_string)

or this one-row should work:

df["row_hash"] = df.astype(str).values.sum(axis=1).apply(hash_string)

However, not sure if you need a separate function here, so:

 df["row_hash"] = df.astype(str).values.sum(axis=1).apply(lambda x: sha1(str(x).encode('utf-8')).hexdigest())

answered Feb 4, 2019 at 15:20

vital_dml

1,3061 gold badge8 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Tarifazo · Accepted Answer · 2019-02-04 15:21:41Z

3

You can use apply twice, first on the row elements then on the result:

df.apply(lambda x: ''.join(x.astype(str)),axis=1).apply(self.hash_string)

Sidenote: I don't understand why you are defining hash_string as an instance method (instead of a plain function), since it doesn't use the self argument. In case you have problems can just pass it as function:

df.apply(lambda x: ''.join(x.astype(str)),axis=1).apply(lambda value: sha1(str(value).encode('utf-8')).hexdigest())

answered Feb 4, 2019 at 15:21

Tarifazo

4,3631 gold badge12 silver badges24 bronze badges

1 Comment

Aviral Srivastava Over a year ago

Keeping a single source of truth is my priority so that if one day I change the hashing algorithm to say, md5, it will be a change in one place and not in multiple operations. :)

Collectives™ on Stack Overflow

How to compute hash of all the columns in Pandas Dataframe?

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related