1

I need to calculate a md5 hash over multiple dataframe columns at once.

Function

def md5 = udf((s: String) => toHex(MessageDigest.getInstance("MD5").digest(s.getBytes("UTF-8"))))
def toHex(bytes: Array[Byte]): String = bytes.map("%02x".format(_)).mkString("")

Example with one column

var test_df = load_df.as('a).select($"a.attr1", md5($"a.attr2").as("hash_key"))

+-------------+--------------------+
|     attr1   |            hash_key|
+-------------+--------------------+
|9/1/2015 0:23|7a2f516dad8f13ae1...|
|9/1/2015 0:31|339c72b1870c3a6be...|
|9/1/2015 0:19|7065847af7abc6bce...|
|9/1/2015 1:32|38c7276958809893b...|

The generation with one column (a.attr2) works pretty good but i can't find any good way to insert/concatenate multiple columns into the md5() function.

0

3 Answers 3

6

You should use concat_ws as followed:

md5(concat_ws(",",$"a.attr2",$"a.attr3",$"a.attr4"))

Here is an example :

Seq(("a","b","c")).toDF("x","y","z").withColumn("foo", md5(concat_ws(",",$"x",$"y",$"z"))).show(false)
// +---+---+---+--------------------------------+
// |x  |y  |z  |foo                             |
// +---+---+---+--------------------------------+
// |a  |b  |c  |a44c56c8177e32d3613988f4dba7962e|
// +---+---+---+--------------------------------+
Sign up to request clarification or add additional context in comments.

Comments

1

Personally, I would do the concatenation inside the UDF, this gives you more flexibility:

e.g. passing array of Strings:

val md5 = udf((arrs:Seq[String]) => {
  val s = arrs.mkString(",")
  // do something with s
  s
 })    

df.withColumn("md5",md5(array($"x",$"y",$"z")))

Or even passing the entire row, which would also work if you have columns of mixed type:

val md5 = udf((r:Row) => {
  val s = r.mkString(",")
  // do something with s
  s
 })

df.withColumn("md5",md5(struct($"x",$"y",$"z")))

Comments

1

If you want to concate all columns using a custom delimiter, use this:

df.withColumn('row_hash', md5(concat_ws('||', *df.columns)))

Useful for calculating row hash.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.