I am new to Spark Streaming and Pandas UDF. I am working on pyspark consumer from kafka, payload is of xml format and trying to parse the incoming xml by applying pandas udf
@pandas_udf("col1 string, col2 string",PandasUDFType.GROUPED_MAP)
def test_udf(df):
import xmltodict
from collections import MutableMapping
xml_str=df.iloc[0,0]
df_col=['col1', 'col2']
doc=xmltodict.parse(xml_str,dict_constructor=dict)
extract_needed_fields = { k:doc[k] for k in df_col }
return pd.DataFrame( [{'col1': 'abc', 'col2': 'def'}] , index=[0] , dtype="string" )
data=df.selectExpr("CAST(value AS STRING) AS value")
data.groupby("value").apply(test_udf).writeStream.format("console").start()
I get the below error
File "pyarrow/array.pxi", line 859, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 215, in pyarrow.lib.array
File "pyarrow/array.pxi", line 104, in pyarrow.lib._handle_arrow_array_protocol
ValueError: Cannot specify a mask or a size when passing an object that is converted with the __arrow_array__ protocol.
Is this the right approach ? What am i doing wrong