PySpark Pipeline Error when using Indexer and Encoder

Question

I'm using the bank data from UCI to just template out a project. I was following the PySpark tutorial on on their documentation site (sorry cant find the link anymore). I keep getting an error when running through the pipeline. I've loaded the data, converted feature types, and have done the pipelining for categorical and numerical features. I'd love any feedback on any part of the code but specifically where I'm getting the error so I can continue with this build out. Thank you in advance!

Sample Data

+---+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+-------+
| id|age|       job|marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome|deposit|
+---+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+-------+
|  1| 59|    admin.|married|secondary|     no|   2343|    yes|  no|unknown|  5|  may|    1042|       1|   -1|       0| unknown|    yes|
|  2| 56|    admin.|married|secondary|     no|     45|     no|  no|unknown|  5|  may|    1467|       1|   -1|       0| unknown|    yes|
|  3| 41|technician|married|secondary|     no|   1270|    yes|  no|unknown|  5|  may|    1389|       1|   -1|       0| unknown|    yes|
|  4| 55|  services|married|secondary|     no|   2476|    yes|  no|unknown|  5|  may|     579|       1|   -1|       0| unknown|    yes|
|  5| 54|    admin.|married| tertiary|     no|    184|     no|  no|unknown|  5|  may|     673|       2|   -1|       0| unknown|    yes|
+---+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+-------+
only showing top 5 rows

# Convert Feature Types
df.createOrReplaceTempView("df")

df2 = spark.sql("select \
                    cast(id as int) as id, \
                    cast(age as int) as age, \
                    cast(job as string) as job, \
                    cast(marital as string) as marital, \
                    cast(education as string) as education, \
                    cast(default as string) as default, \
                    cast(balance as int) as balance, \
                    cast(housing as string) as housing, \
                    cast(loan as string) as loan, \
                    cast(contact as string) as contact, \
                    cast(day as int) as day, \
                    cast(month as string) as month, \
                    cast(duration as int) as duration, \
                    cast(campaign as int) as campaign, \
                    cast(pdays as int) as pdays, \
                    cast(previous as int) as previous, \
                    cast(poutcome as string) as poutcome, \
                    cast(deposit as string) as deposit \
                from df")

# Data Types
df2.dtypes

[('id', 'int'),
 ('age', 'int'),
 ('job', 'string'),
 ('marital', 'string'),
 ('education', 'string'),
 ('default', 'string'),
 ('balance', 'int'),
 ('housing', 'string'),
 ('loan', 'string'),
 ('contact', 'string'),
 ('day', 'int'),
 ('month', 'string'),
 ('duration', 'int'),
 ('campaign', 'int'),
 ('pdays', 'int'),
 ('previous', 'int'),
 ('poutcome', 'string'),
 ('deposit', 'string')]


 # Build Pipeline (Error is Here)
categorical_cols = ["job","marital","education","default","housing","loan","contact","month","poutcome"]
numeric_cols = ["age", "balance", "day", "duration", "campaign", "pdays","previous"]

stages = []

stringIndexer = StringIndexer(inputCol=[cols for cols in categorical_cols],
                              outputCol=[cols + "_index" for cols in categorical_cols])

encoder = OneHotEncoderEstimator(inputCols=[cols + "_index" for cols in categorical_cols],
                                 outputCols=[cols + "_classVec" for cols in categorical_cols])

stages += [stringIndexer, encoder]

label_string_id = StringIndexer(inputCol="deposit", outputCol="label")
stages += [label_string_id]

assembler_inputs = [cols + "_classVec" for cols in categorical_cols] + numeric_cols
assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")
stages += [assembler]

# Run Data Through Pipeline
pipeline = Pipeline().setStages(stages)
pipeline_model = pipeline.fit(df2)
prepped_df = pipeline_model.transform(df2)

Error

"TypeError: Invalid param value given for param "inputCols". Could not convert job_index to list of strings"

user10938362 · Accepted Answer · 2019-06-13 19:29:59Z

7

That's because OneHotEncoderEstimator (unlike legacy OneHotEncoder) takes multiple columns and yields multiple columns (please note that both parameters are plural - Cols not Col). So you should either wrap each call with a list,

for cols in categorical_cols:
    ...
    encoder = OneHotEncoderEstimator(
      inputCols=[cols + "_index"], outputCols=[cols + "_classVec"]
    )
    ...

or better pass all columns at the same time, outside the for loop:

encoder = OneHotEncoderEstimator(
    inputCols=[col + "_index" for cols in categorical_cols], 
    outputCols=[col + "_classVec" for for col in categorical_cols]
)
stages += [encoder]

If you're in doubt what is the expected input / output you can always inspect the corresponding Param:

from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer

OneHotEncoderEstimator.inputCols.typeConverter
## <function pyspark.ml.param.TypeConverters.toListString(value)>

StringIndexer.inputCol.typeConverter
## <function pyspark.ml.param.TypeConverters.toString(value)>

As you can see the former one requires object coercible to a list of strings, while the latter one just a string.

edited Jun 13, 2019 at 19:29

answered Jun 13, 2019 at 17:37

user10938362

4,1822 gold badges15 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Andre Over a year ago

Thanks for the quick response. That makes sense for the encoder and a better way to write it as well. I'm still getting an error around the stringIndexer. "TypeError: Invalid param value given for param "inputCol". Could not convert <class 'list'> to string type". I updated the above code.

user10938362 Over a year ago

That's because StringIndexer is a different beast . It takes a single column (note inputCol instead of inputCols) and returns a single column (note outputCol instead of outputCols). In other words, mind the pluralization, and if your in doubt check Scala API or the specific Param(s)

Chuck Over a year ago

With a column of strings. Do you have to run StringIndexer() as well as OneHotEncoderEstimator()? Or can you just run the latter? I'm running spark 2.3.

user10938362 Over a year ago

You have to index first @Chuck - stackoverflow.com/q/35804755/10938362

Collectives™ on Stack Overflow

PySpark Pipeline Error when using Indexer and Encoder

Sample Data

Error

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Sample Data

Error

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related