Find pattern in URL with stringr and regex

Question

I have a dataframe df with some urls. There are subcategories within the slashes in the URLs I want to extract with stringr and str_extract

My data looks like

Text         URL
Hello        www.facebook.com/group1/bla/exy/1234
Test         www.facebook.com/group2/fssas/eda/1234
Text         www.facebook.com/group-sdja/sdsds/adeds/23234
Texter       www.facebook.com/blablabla/sdksds/sdsad

I now want to extract everything after .com/ and the next /

I tried suburlpattern <- "^.com//{1,20}//$" and df$categories <- str_extract(df$URL, suburlpattern)

But I only end up with NA in df$categories

Any idea what I am doing wrong here? Is it my regex code?

Any help is highly appreciated! Many thanks beforehand.

^ in a regex pattern implies it only matches at the beginning of the string. Since .com isn't at the start of the url, your pattern won't match. You probably don't need the ^. — Amber
– Amber, Commented Dec 20, 2016 at 23:04
Thanks Amber, but it unfortunately still does only give me NAs... Any other idea? — rkuebler
– rkuebler, Commented Dec 20, 2016 at 23:17

Wiktor Stribiżew · Accepted Answer · 2016-12-20 23:43:39Z

2

If you want to use str_extract, you need a regex that will get the value you need into the whole match, and you will need a (?<=[.]com/) lookbehind:

(?<=[.]com/)[^/]+

See the regex demo.

Details:

(?<=[.]com/) - the current location must be preceded with .com/ substring
[^/]+ - matches 1 or more characters other than /.

R demo:

> URL = c("www.facebook.com/group1/bla/exy/1234", "www.facebook.com/group2/fssas/eda/1234","www.facebook.com/group-sdja/sdsds/adeds/23234", "www.facebook.com/blablabla/sdksds/sdsad")
> df <- data.frame(URL)
> library(stringr)
> res <- str_extract(df$URL, "(?<=[.]com/)[^/]+")
> res
[1] "group1"     "group2"     "group-sdja" "blablabla"

answered Dec 20, 2016 at 23:43

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

manotheshark · Accepted Answer · 2016-12-21 03:24:32Z

1

this will return everything between the first set of forward slashes

library(stringr)
str_match("www.facebook.com/blablabla/sdksds/sdsad", "^[^/]+/(.+?)/")[2]

[1] "blablabla"

edited Dec 21, 2016 at 3:24

answered Dec 20, 2016 at 23:32

manotheshark

4,36520 silver badges31 bronze badges

1 Comment

Wiktor Stribiżew Over a year ago

You may replace all \\/ with / as the forward slash is not a special regex metacharacter.

Matt S · Accepted Answer · 2016-12-20 23:45:59Z

0

This works

library(stringr)
data <- c("www.facebook.com/group1/bla/exy/1234", 
          "www.facebook.com/group2/fssas/eda/1234",
          "www.facebook.com/group-sdja/sdsds/adeds/23234",
          "www.facebook.com/blablabla/sdksds/sdsad")

suburlpattern <- "/(.*?)/" 
categories <- str_extract(data, suburlpattern)
str_sub(categories, start = 2, end = -2)

Results:

[1] "group1" "group2" "group-sdja" "blablabla"

Will only get you what's between the first and second slashes... but that seems to be what you want.

answered Dec 20, 2016 at 23:45

Matt S

3951 gold badge2 silver badges11 bronze badges

Collectives™ on Stack Overflow

Find pattern in URL with stringr and regex

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related