0

I feel like I'm very close to a solution here but can't seem to figure out why I'm not getting any result. I have an html page and I'm trying to parse out some IDs from it. I'm 99% certain my regex code is right, but for some reason I'm not getting any output.

In the html source, there are many ids that are wrapped with text like: /boardgame/9999/asdf. My regex code should pull out the /9999/ bit, but I can't figure out why it's just returning the same input html character string that I put in.

library(RCurl)
library(XML)
url <- sprintf("https://boardgamegeek.com/browse/boardgame/page/1")
html <- getURL(url, followlocation = TRUE)
id.parse <- sub("boardgame(.*?)[a-z]", "\\1", html)

Any thoughts?

1

1 Answer 1

1

I think your pattern was not accurate. In this case, you were picking up also other words, starting with "boardgames,". This should work for one single ID.

id.pos <- regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html)
my.id <- substr(html, id.pos, id.pos + attributes(id.pos)$match.length)
gsub("(^[[:alpha:]]*/)|(/[[:alpha:]]*$)", "", my.id)

In my hands, it returns:

[1] "226501"

Also, I found many IDs in this html page. To catch them all in one list, you could do as follows.

url <- sprintf("https://boardgamegeek.com/browse/boardgame/page/1")
html <- getURL(url, followlocation = TRUE)
id.list <- list()
while (regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html) > 0) {
  id.pos <- regexpr("boardgame/[[:digit:]]{3,10}/[a-z]", html)
  my.id <- substr(html, id.pos, id.pos + attributes(id.pos)$match.length)
  id.list[[(length(id.list) + 1)]] <- gsub("(^[[:alpha:]]*/)|(/[[:alpha:]]*$)", "", my.id)
  html <- substr(html, (id.pos + attributes(id.pos)$match.length), nchar(html))
}
id.list
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.