0

I have a classification model in Spark MLlib which was built using training data. Now I would like to use it to predict unlabeled data.

I have my features (Without the labels) in LIBSVM format. This is a sample of how my unlabeled data look like

1:1  18:1
4:1  32:1
2:1  8:1  33:1
1:1  6:1  11:1
1:1  2:1  8:1  28:1

I have these features saved in a text file on HDFS. How can I load them in RDD[Vector] so I can pass them to model.predict()?

I use Scala for coding.

Thanks.

3
  • map, split on space, split on : to create a sparse vector Commented Dec 2, 2015 at 18:30
  • How do you know the number of dimensions of the data from a LIBSVM file? Commented Dec 2, 2015 at 20:40
  • 1
    He'll need 2 passes over the data. Commented Dec 2, 2015 at 21:02

1 Answer 1

2

Here is a solution considering that indices are one-based and in ascending order.

Let's create some dummy data similar to the one in your text file.

val data = sc.parallelize(Seq("1:1  18:1", "4:1  32:1", "2:1  8:1  33:1", "1:1  6:1  11:1", "1:1  2:1  8:1  28:1"))

We can now transform the data into a pair RDD with indices and values.

val parsed = data.map(_.trim).map { line =>
  val items = line.split(' ')
  val (indices, values) = items.filter(_.nonEmpty).map { item =>
    val indexAndValue = item.split(':')
    val index = indexAndValue(0).toInt - 1 // Convert 1-based indices to 0-based.
  val value = indexAndValue(1).toDouble
    (index, value)
  }.unzip

  (indices.toArray, values.toArray)
}

Get the number of features

val numFeatures = parsed.map { case (indices, values) => indices.lastOption.getOrElse(0) }.reduce(math.max) + 1

And finally create Vectors

val vectors = parsed.map { case (indices, values) => Vectors.sparse(numFeatures, indices, values) }

vectors.take(10) foreach println
// (33,[0,17],[1.0,1.0])
// (33,[3,31],[1.0,1.0])
// (33,[1,7,32],[1.0,1.0,1.0])
// (33,[0,5,10],[1.0,1.0,1.0])
// (33,[0,1,7,27],[1.0,1.0,1.0,1.0])
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.