how to remove certain regular expression in PySpark using RDD?

Question

Suppose I'm trying to remove this regular expression "RT\s*@USER\w\w{8}:\s*" and I want to remove this form of regular expression in my RDD.

My current RDD is:

text = sc.textFile(...)
delimited = text.map(lambda x: x.split("\t"))

and here is the part where I'm trying to remove regular expression. I tried doing following RDD transformation to get rid of every strings that matches this regular expression but it all gave me an error.

abc = delimited.map(lambda x: re.sub(r"RT\s*@USER\w\w{8}:\s*", " ", x))
TypeError: expected string or buffer

and

abc = re.sub(r"RT\s*@USER\w\w{8}:\s*", " ", delimited)
TypeError: expected string or buffer

and

abc = delimited.map(lambda x: re.sub(r"RT\s*@USER\w\w{8}:\s*", " ", text))
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

I want to remove this regular expression so that I can proceed to the next RDD transformations. How do I make this code in PySpark?

MaFF · Accepted Answer · 2017-10-29 18:10:47Z

1

re.sub expects a string.

in the first anonymous function:
```
lambda x: re.sub(r"RT\s*@USER\w\w{8}:\s*", " ", x)
```
x is a list, since you split the line in the previous transformation.
In the second try, you pass an RDD: delimeted
In the third snippet of code you pass another RDD: text.

If you want to remove this regular expression for every element of your list, try this:

abc = delimited.map(lambda l: [re.sub(r"RT\s*@USER\w\w{8}:\s*", " ", x) for x in l])

answered Oct 29, 2017 at 18:10

MaFF

10.2k2 gold badges39 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

how to remove certain regular expression in PySpark using RDD?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related