Scala Spark filter rows in DataFrame with substring and character

Question

I want to filter some rows in my DF, keeping rows where a column starts with "startSubString" and do not contain the character '#'.

I can do what I want with two filters:

.filter( _!= col("theCol").contains("#"))
.filter( col("theCol").startsWith("startSubString"))

But was wondering if it could not be done in just one filter for better performance:

something like:

.filter(col("theCol").rlike("^(startSubString).*^[^@]"))

although this does not work. What am I missing?

you can always use ||. .filter( _!= col("theCol").contains("#") || col("theCol").startsWith("http")) doesn't that work? — Anahcolus
– Anahcolus, Commented Dec 2, 2017 at 2:48
I would leave it as it is, I think its more readable than 1 huge logical expression. Spark's optimizer will combine the filter anyway, so I think you don't gain performance — Raphael Roth
– Raphael Roth, Commented Dec 2, 2017 at 16:27

uh_big_mike_boi · Accepted Answer · 2017-12-02 03:30:27Z

3

I use substr() all the time but I don't see why starsWith() wouldn't work either, but here is what I did...

.filter( (!(col("theCol").contains("#"))) && (col("theCol").substr(1,4) === ("http")))

answered Dec 2, 2017 at 3:30

uh_big_mike_boi

3,4685 gold badges39 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Alg_D · Accepted Answer · 2017-12-02 13:46:38Z

1

you can use startsWith()

.filter( !col("theCol").contains("#") && col("theCol").startsWith("startSubString") )

answered Dec 2, 2017 at 13:46

Alg_D

2,4307 gold badges37 silver badges68 bronze badges