2

I have a PySpark dataframe with one of the columns (features) being a sparse vector. For example:

+------------------+-----+
|     features     |label|
+------------------+-----+
| (4823,[87],[0.0])|  0.0|
| (4823,[31],[2.0])|  0.0|
|(4823,[159],[0.0])|  1.0|
|  (4823,[1],[7.0])|  0.0|
|(4823,[15],[27.0])|  0.0|
+------------------+-----+

I would like to expand the features column and to add another feature to it, for example:

+-------------------+-----+
|     features      |label|
+-------------------+-----+
| (4824,[87],[0.0]) |  0.0|
| (4824,[31],[2.0]) |  0.0|
|(4824,[159],[0.0]) |  1.0|
|  (4824,[1],[7.0]) |  0.0|
|(4824,[4824],[7.0])|  0.0|
+-------------------+-----+

Is there a way to do this without unpacking the SparseVector to dense and then repacking it to sparse with the new column?

1 Answer 1

3

Adding a new column to an existing SparseVector can be easiest done using the VectorAssembler transformer in the ML library. It will automatically combine columns into a vector (DenseVector or SparseVector depending on which use the least memory). Using VectorAssembler will not convert the vector into a DenseVector during the merging process (see the source code). It can be used as follows:

df = ...

assembler = VectorAssembler(
    inputCols=["features", "new_col"],
    outputCol="features")

output = assembler.transform(df)

To simply increase the size of a SparseVector, without adding any new values, just create a new vector with larger size:

def add_empty_col_(v):
    return SparseVector(v.size + 1, v.indices, v.values)

add_empty_col = udf(add_empty_col_, VectorUDT())
df.withColumn("sparse", add_empty_col(col("features"))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.