I am trying to execute the following lines of code but in a much larger RDD. Apparently, I get a heap size error when a is very large. How can I make this work? p is usually small.
val p = Array("id1", "id3", "id2");
val a = sc.parallelize(Array(("id1", ("1", "1")), ("id4", ("4", "4")), ("id2", ("2", "2"))));
val f = a.filter(x=> p contains x._1);
println(f.collect().mkString(";"));
p. I don't know if this will solve your problem as it seems to be tied to the size ofa, but it is good practice in general.val p_b = sc.broadcast(p); val f = a.filter(x=> p_b contains x._1);but unfortunately I get the following error<console>:62: error: value contains is not a member of org.apache.spark.broadcast.Broadcast[Array[String]] val f = a.filter(x=> p_b contains x._1);p_b.valuebecause you have to dereference the broadcast variable.containsuse aSetwhich optimizes such operation.