2

I want to filter some rows in my DF, keeping rows where a column starts with "startSubString" and do not contain the character '#'.

I can do what I want with two filters:

.filter( _!= col("theCol").contains("#"))
.filter( col("theCol").startsWith("startSubString"))

But was wondering if it could not be done in just one filter for better performance:

something like:

.filter(col("theCol").rlike("^(startSubString).*^[^@]"))

although this does not work. What am I missing?

2
  • you can always use ||. .filter( _!= col("theCol").contains("#") || col("theCol").startsWith("http")) doesn't that work? Commented Dec 2, 2017 at 2:48
  • I would leave it as it is, I think its more readable than 1 huge logical expression. Spark's optimizer will combine the filter anyway, so I think you don't gain performance Commented Dec 2, 2017 at 16:27

2 Answers 2

3

I use substr() all the time but I don't see why starsWith() wouldn't work either, but here is what I did...

.filter( (!(col("theCol").contains("#"))) && (col("theCol").substr(1,4) === ("http")))
Sign up to request clarification or add additional context in comments.

Comments

1

you can use startsWith()

.filter( !col("theCol").contains("#") && col("theCol").startsWith("startSubString") )

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.