pyspark.ml pipelines: are custom transformers necessary for basic preprocessing tasks?

Question

Getting started with pyspark.ml and the pipelines API, I find myself writing custom transformers for typical preprocessing tasks in order to use them in a pipeline. Examples:

from pyspark.ml import Pipeline, Transformer


class CustomTransformer(Transformer):
    # lazy workaround - a transformer needs to have these attributes
    _defaultParamMap = dict()
    _paramMap = dict()
    _params = dict()

class ColumnSelector(CustomTransformer):
    """Transformer that selects a subset of columns
    - to be used as pipeline stage"""

    def __init__(self, columns):
        self.columns = columns


    def _transform(self, data):
        return data.select(self.columns)


class ColumnRenamer(CustomTransformer):
    """Transformer renames one column"""


    def __init__(self, rename):
        self.rename = rename

    def _transform(self, data):
        (colNameBefore, colNameAfter) = self.rename
        return data.withColumnRenamed(colNameBefore, colNameAfter)


class NaDropper(CustomTransformer):
    """
    Drops rows with at least one not-a-number element
    """

    def __init__(self, cols=None):
        self.cols = cols


    def _transform(self, data):
        dataAfterDrop = data.dropna(subset=self.cols) 
        return dataAfterDrop


class ColumnCaster(CustomTransformer):

    def __init__(self, col, toType):
        self.col = col
        self.toType = toType

    def _transform(self, data):
        return data.withColumn(self.col, data[self.col].cast(self.toType))

They work, but I was wondering if this is a pattern or antipattern - are such transformers a good way to work with the pipeline API? Was it necessary to implement them, or is equivalent functionality provided somewhere else?

how do you call the custom transformers?

Windstorm1981
– Windstorm1981

2019-05-16 16:05:35 +00:00
Commented May 16, 2019 at 16:05 — Windstorm1981
– Windstorm1981, Commented May 16, 2019 at 16:05

Alper t. Turker · Accepted Answer · 2018-04-09 13:58:23Z

4

I'd say it is primarily opinion based, although it looks unnecessarily verbose and Python Transformers don't integrate well with the rest of the Pipeline API.

It is also worth pointing out that everything you have here can be easily achieved with SQLTransformer. For example:

from pyspark.ml.feature import SQLTransformer

def column_selector(columns):
    return SQLTransformer(
        statement="SELECT {} FROM __THIS__".format(", ".join(columns))
    )

or

def na_dropper(columns):
    return SQLTransformer(
        statement="SELECT * FROM __THIS__ WHERE {}".format(
            " AND ".join(["{} IS NOT NULL".format(x) for x in columns])
        )
    )

With a little bit of effort you can use SQLAlchemy with Hive dialect to avoid handwritten SQL.

answered Apr 9, 2018 at 13:58

Alper t. Turker

35.3k9 gold badges89 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

clstaudt Over a year ago

Could you elaborate on "Python Transformers don't integrate well with the rest of the Pipeline API"?

Alper t. Turker Over a year ago

By default there are not MLWritable (although there are nice hacks).

clstaudt Over a year ago

Well, SQL is not the elegant alternative I had hoped for, but good answer nonetheless -> accept.

Collectives™ on Stack Overflow

pyspark.ml pipelines: are custom transformers necessary for basic preprocessing tasks?

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related