Pyspark - add another column to a sparse vector column

Question

I have a PySpark dataframe with one of the columns (features) being a sparse vector. For example:

+------------------+-----+
|     features     |label|
+------------------+-----+
| (4823,[87],[0.0])|  0.0|
| (4823,[31],[2.0])|  0.0|
|(4823,[159],[0.0])|  1.0|
|  (4823,[1],[7.0])|  0.0|
|(4823,[15],[27.0])|  0.0|
+------------------+-----+

I would like to expand the features column and to add another feature to it, for example:

+-------------------+-----+
|     features      |label|
+-------------------+-----+
| (4824,[87],[0.0]) |  0.0|
| (4824,[31],[2.0]) |  0.0|
|(4824,[159],[0.0]) |  1.0|
|  (4824,[1],[7.0]) |  0.0|
|(4824,[4824],[7.0])|  0.0|
+-------------------+-----+

Is there a way to do this without unpacking the SparseVector to dense and then repacking it to sparse with the new column?

Shaido · Accepted Answer · 2018-07-03 06:00:48Z

3

Adding a new column to an existing SparseVector can be easiest done using the VectorAssembler transformer in the ML library. It will automatically combine columns into a vector (DenseVector or SparseVector depending on which use the least memory). Using VectorAssembler will not convert the vector into a DenseVector during the merging process (see the source code). It can be used as follows:

df = ...

assembler = VectorAssembler(
    inputCols=["features", "new_col"],
    outputCol="features")

output = assembler.transform(df)

To simply increase the size of a SparseVector, without adding any new values, just create a new vector with larger size:

def add_empty_col_(v):
    return SparseVector(v.size + 1, v.indices, v.values)

add_empty_col = udf(add_empty_col_, VectorUDT())
df.withColumn("sparse", add_empty_col(col("features"))

answered Jul 3, 2018 at 6:00

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pyspark - add another column to a sparse vector column

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related