5

I have the following code which basically is doing feature engineering pipeline:

token_q1=Tokenizer(inputCol='question1',outputCol='question1_tokens') 
token_q2=Tokenizer(inputCol='question2',outputCol='question2_tokens')  

remover_q1=StopWordsRemover(inputCol='question1_tokens',outputCol='question1_tokens_filtered')
remover_q2=StopWordsRemover(inputCol='question2_tokens',outputCol='question2_tokens_filtered')

q1w2model = Word2Vec(inputCol='question1_tokens_filtered',outputCol='q1_vectors')
q1w2model.setSeed(1)

q2w2model = Word2Vec(inputCol='question2_tokens_filtered',outputCol='q2_vectors')
q2w2model.setSeed(1)

pipeline=Pipeline(stages[token_q1,token_q2,remover_q1,remover_q2,q1w2model,q2w2model])
model=pipeline.fit(train)
result=model.transform(train)
result.show()

I want to add the following UDF to this above pipeline:

charcount_q1 = F.udf(lambda row : sum([len(char) for char in row]),IntegerType())

When I do that, I get Java error. Can someone point me to the right direction?

However, I was adding this column using the following code which basically works:

charCountq1=train.withColumn("charcountq1", charcount_q1("question1"))

But I want to add it into a pipleline rather than doing this way

0

1 Answer 1

9

If you want to use udf in Pipeline you'll need one of the following:

The first one is quite verbose for such a simple use case so I recommend the second option:

from pyspark.sql.functions import udf
from pyspark.ml import Pipeline
from pyspark.ml.feature import SQLTransformer

charcount_q1 = spark.udf.register(
    "charcount_q1",
    lambda row : sum(len(char) for char in row),
    "integer"
)

df = spark.createDataFrame(
    [(1, ["spark", "java", "python"])],
    ("id", "question1"))

pipeline = Pipeline(stages = [SQLTransformer(
    statement = "SELECT *, charcount_q1(question1) charcountq1 FROM __THIS__"
)])

pipeline.fit(df).transform(df).show()
# +---+--------------------+-----------+
# | id|           question1|charcountq1|
# +---+--------------------+-----------+
# |  1|[spark, java, pyt...|         15|
# +---+--------------------+-----------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.