Suppose I'm trying to remove this regular expression "RT\s*@USER\w\w{8}:\s*" and I want to remove this form of regular expression in my RDD.
My current RDD is:
text = sc.textFile(...)
delimited = text.map(lambda x: x.split("\t"))
and here is the part where I'm trying to remove regular expression. I tried doing following RDD transformation to get rid of every strings that matches this regular expression but it all gave me an error.
abc = delimited.map(lambda x: re.sub(r"RT\s*@USER\w\w{8}:\s*", " ", x))
TypeError: expected string or buffer
and
abc = re.sub(r"RT\s*@USER\w\w{8}:\s*", " ", delimited)
TypeError: expected string or buffer
and
abc = delimited.map(lambda x: re.sub(r"RT\s*@USER\w\w{8}:\s*", " ", text))
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
I want to remove this regular expression so that I can proceed to the next RDD transformations. How do I make this code in PySpark?