Create an Spark udf function to iterate over an Array of bytes and convert it to numeric

Question

I have a Dataframe with an array of bytes in spark (python)

DF.select(DF.myfield).show(1, False)
+----------------+                                                              
|myfield         |
+----------------+
|[00 8F 2B 9C 80]|
+----------------+

i'm trying to convert this array to a string

'008F2B9C80'

then to the numeric value

int('008F2B9C80',16)/1000000
> 2402.0

I have found some udf sample, so i already can extract a part of the array like this :

u = f.udf(lambda a: format(a[1],'x'))
DF.select(u(DF['myfield'])).show()
+------------------+                                                            
|<lambda>(myfield) |
+------------------+
|                8f|
+------------------+

Now how to iterate over the whole array ? Is it possible to do all the operations i have to code in the udf function ?

May be there is a best way to do the cast ???

Thanks for your help

Ftagn · Accepted Answer · 2018-12-04 22:45:14Z

2

I have found a python solution too

from pyspark.sql.functions import udf
spark.udf.register('ByteArrayToDouble', lambda x: int.from_bytes(x, byteorder='big', signed=False) / 10e5)
spark.sql('select myfield, ByteArrayToDouble(myfield) myfield_python, convert_binary(hex(myfield))/1000000 myfield_scala from my_table').show(1, False)
+-------------+-----------------+----------------+
|myfield      |myfield_python   |myfield_scala   |
+-------------+-----------------+----------------+
|[52 F4 92 80]|1391.76          |1391.76         |
+-------------+-----------------+----------------+
only showing top 1 row

I'm now able to bench the two solutions

Thank you for your precious help

answered Dec 4, 2018 at 22:45

Ftagn

3154 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

lu5er · Accepted Answer · 2018-12-10 11:30:08Z

1

I came across this question while answering your newest one.

Suppose you have the df as

+--------------------+
|             myfield|
+--------------------+
|[00, 8F, 2B, 9C, 80]|
|    [52, F4, 92, 80]|
+--------------------+

Now you can use the following lambda function

def func(val):
    return int("".join(val), 16)/1000000
func_udf = udf(lambda x: func(x), FloatType())

And to create the output, use

df = df.withColumn("myfield1", func_udf("myfield"))

This yields,

+--------------------+--------+
|             myfield|myfield1|
+--------------------+--------+
|[00, 8F, 2B, 9C, 80]|  2402.0|
|    [52, F4, 92, 80]| 1391.76|
+--------------------+--------+

answered Dec 10, 2018 at 11:30

lu5er

3,6223 gold badges33 silver badges54 bronze badges

Comments

stack0114106 · Accepted Answer · 2018-11-27 21:25:51Z

0

Here is the scala df solution. You need to import the scala.math.BigInteger

scala> val df = Seq((Array("00","8F","2B","9C","80"))).toDF("id")
df: org.apache.spark.sql.DataFrame = [id: array<string>]

scala> df.withColumn("idstr",concat_ws("",'id)).show
+--------------------+----------+
|                  id|     idstr|
+--------------------+----------+
|[00, 8F, 2B, 9C, 80]|008F2B9C80|
+--------------------+----------+


scala> import scala.math.BigInt
import scala.math.BigInt

scala> def convertBig(x:String):String = BigInt(x.sliding(2,2).map( x=> Integer.parseInt(x,16)).map(_.toByte).toArray).toString
convertBig: (x: String)String

scala> val udf_convertBig =  udf( convertBig(_:String):String )
udf_convertBig: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))

scala> df.withColumn("idstr",concat_ws("",'id)).withColumn("idBig",udf_convertBig('idstr)).show(false)
+--------------------+----------+----------+
|id                  |idstr     |idBig     |
+--------------------+----------+----------+
|[00, 8F, 2B, 9C, 80]|008F2B9C80|2402000000|
+--------------------+----------+----------+


scala>

There is no spark equivalent for scala's BigInteger, so I'm converting the udf() result to string.

answered Nov 27, 2018 at 21:25

stack0114106

8,8934 gold badges16 silver badges40 bronze badges

7 Comments

Ftagn Over a year ago

It sounds very interesting, i'll try to call a scala udf function in my pyspark project now. (medium.com/wbaa/using-scala-udfs-in-pyspark-b70033dd69b9)

Ftagn Over a year ago

Thank you I've created an udf function, and compiled successfully :

package com.mycompany.spark.udf     import org.apache.spark.sql.api.java.UDF1     import scala.math.BigInt      import scala.util.Try     class ConvertBinaryDecimal extends UDF1[String, String] {     override def call(TableauBinary: String):String = BigInt(TableauBinary.sliding(2,2).map( TableauBinary=> Integer.parseInt(TableauBinary,16)).map(_.toByte).toArray).toString}

My last problem is to call it directly with the dataframe binary field. Is it possible tu convert the binary to string inside the function ?

stack0114106 Over a year ago

that's what I did in my UDF - idstr is string in my answer.

Ftagn Over a year ago

I missed the convertBig(_:String):String ), sorry !

Ftagn Over a year ago

In fact, i wondered if it is possible to call the udf directly with the binary field in parameter, not the "withcolumn" string result. Something like df.withColumn("idBig",udf_convertBig('id)).show(false) sending the binary array 'id' to the ConvertBinaryDecimal function

|

Collectives™ on Stack Overflow

Create an Spark udf function to iterate over an Array of bytes and convert it to numeric

3 Answers 3

Comments

Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related