I need to calculate a md5 hash over multiple dataframe columns at once.
Function
def md5 = udf((s: String) => toHex(MessageDigest.getInstance("MD5").digest(s.getBytes("UTF-8"))))
def toHex(bytes: Array[Byte]): String = bytes.map("%02x".format(_)).mkString("")
Example with one column
var test_df = load_df.as('a).select($"a.attr1", md5($"a.attr2").as("hash_key"))
+-------------+--------------------+
| attr1 | hash_key|
+-------------+--------------------+
|9/1/2015 0:23|7a2f516dad8f13ae1...|
|9/1/2015 0:31|339c72b1870c3a6be...|
|9/1/2015 0:19|7065847af7abc6bce...|
|9/1/2015 1:32|38c7276958809893b...|
The generation with one column (a.attr2) works pretty good but i can't find any good way to insert/concatenate multiple columns into the md5() function.