1

I have a dataframe df with some urls. There are subcategories within the slashes in the URLs I want to extract with stringr and str_extract

My data looks like

Text         URL
Hello        www.facebook.com/group1/bla/exy/1234
Test         www.facebook.com/group2/fssas/eda/1234
Text         www.facebook.com/group-sdja/sdsds/adeds/23234
Texter       www.facebook.com/blablabla/sdksds/sdsad

I now want to extract everything after .com/ and the next /

I tried suburlpattern <- "^.com//{1,20}//$" and df$categories <- str_extract(df$URL, suburlpattern)

But I only end up with NA in df$categories

Any idea what I am doing wrong here? Is it my regex code?

Any help is highly appreciated! Many thanks beforehand.

2
  • 1
    ^ in a regex pattern implies it only matches at the beginning of the string. Since .com isn't at the start of the url, your pattern won't match. You probably don't need the ^. Commented Dec 20, 2016 at 23:04
  • Thanks Amber, but it unfortunately still does only give me NAs... Any other idea? Commented Dec 20, 2016 at 23:17

3 Answers 3

2

If you want to use str_extract, you need a regex that will get the value you need into the whole match, and you will need a (?<=[.]com/) lookbehind:

(?<=[.]com/)[^/]+

See the regex demo.

Details:

  • (?<=[.]com/) - the current location must be preceded with .com/ substring
  • [^/]+ - matches 1 or more characters other than /.

R demo:

> URL = c("www.facebook.com/group1/bla/exy/1234", "www.facebook.com/group2/fssas/eda/1234","www.facebook.com/group-sdja/sdsds/adeds/23234", "www.facebook.com/blablabla/sdksds/sdsad")
> df <- data.frame(URL)
> library(stringr)
> res <- str_extract(df$URL, "(?<=[.]com/)[^/]+")
> res
[1] "group1"     "group2"     "group-sdja" "blablabla"
Sign up to request clarification or add additional context in comments.

Comments

1

this will return everything between the first set of forward slashes

library(stringr)
str_match("www.facebook.com/blablabla/sdksds/sdsad", "^[^/]+/(.+?)/")[2]

[1] "blablabla"

1 Comment

You may replace all \\/ with / as the forward slash is not a special regex metacharacter.
0

This works

library(stringr)
data <- c("www.facebook.com/group1/bla/exy/1234", 
          "www.facebook.com/group2/fssas/eda/1234",
          "www.facebook.com/group-sdja/sdsds/adeds/23234",
          "www.facebook.com/blablabla/sdksds/sdsad")

suburlpattern <- "/(.*?)/" 
categories <- str_extract(data, suburlpattern)
str_sub(categories, start = 2, end = -2)

Results:

[1] "group1" "group2" "group-sdja" "blablabla"

Will only get you what's between the first and second slashes... but that seems to be what you want.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.