2

df.apply is a method that can apply a certain function to all the columns in a dataframe, or the required columns. However, my aim is to compute the hash of a string: this string is the concatenation of all the values in a row corresponding to all the columns. My current code is returning NaN.

The current code is:

df["row_hash"] = df["row_hash"].apply(self.hash_string)

The function self.hash_string is:

def hash_string(self, value):
        return (sha1(str(value).encode('utf-8')).hexdigest())

Yes, it would be easier to merge all columns of Pandas dataframe but current answer couldn't help me either.

The file that I am reading is(the first 10 rows):

16012,16013,16014,16015,16016,16017,16018,16019,16020,16021,16022
16013,16014,16015,16016,16017,16018,16019,16020,16021,16022,16023
16014,16015,16016,16017,16018,16019,16020,16021,16022,16023,16024
16015,16016,16017,16018,16019,16020,16021,16022,16023,16024,16025
16016,16017,16018,16019,16020,16021,16022,16023,16024,16025,16026

The col names are: col_test_1, col_test_2, .... , col_test_11

3
  • Could you add some sample input? Commented Feb 4, 2019 at 15:11
  • Just to clarify, the non-hashed value of the first row should be something like 160121601316014...? Commented Feb 4, 2019 at 15:17
  • @AndrewF Yes! It is the concatenation of the values of all columns in the same order as in the file and then the hash of that concatenated string. Commented Feb 4, 2019 at 15:19

2 Answers 2

4

You can create a new column, which is concatenation of all others:

df['new'] = df.astype(str).values.sum(axis=1)

And then apply your hash function on it

df["row_hash"] = df["new"].apply(self.hash_string)

or this one-row should work:

df["row_hash"] = df.astype(str).values.sum(axis=1).apply(hash_string)

However, not sure if you need a separate function here, so:

 df["row_hash"] = df.astype(str).values.sum(axis=1).apply(lambda x: sha1(str(x).encode('utf-8')).hexdigest())
Sign up to request clarification or add additional context in comments.

Comments

3

You can use apply twice, first on the row elements then on the result:

df.apply(lambda x: ''.join(x.astype(str)),axis=1).apply(self.hash_string)

Sidenote: I don't understand why you are defining hash_string as an instance method (instead of a plain function), since it doesn't use the self argument. In case you have problems can just pass it as function:

df.apply(lambda x: ''.join(x.astype(str)),axis=1).apply(lambda value: sha1(str(value).encode('utf-8')).hexdigest())

1 Comment

Keeping a single source of truth is my priority so that if one day I change the hashing algorithm to say, md5, it will be a change in one place and not in multiple operations. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.