retrieve data sequentially from web page in R

Question

I have done an advanced search in a web and get some results. For each result I'm interested in extracting 2 fields, "Referencia:" and "CIF".

#This is the url with the results of the search
url="http://www.boe.es/buscar/boe.php?campo%5B1%5D=DOC&dato%5B1%5D=edicto+auto+declaracion+concurso+CIF
&campo%5B6%5D=FPU&dato%5B6%5D%5B0%5D=25%2F04%2F2013&dato%5B6%5D%5B1%5D=30%2F04%2F2013
&sort_field%5B0%5D=fpu&sort_order%5B0%5D=desc&sort_field%5B1%5D=ref&sort_order%5B1%5D=asc&accion=Buscar"

#This is the url of one of the results.
example=http://www.boe.es/buscar/doc.php?id=BOE-B-2013-15895

The CIF field usually of the form X00000000 or X-00000000 with X=c("A","B") and 0=0:9 and The Referencia field is BOE-B-2013-15895 in the example and the CIF B-32210196

Could you help me to do it from R?

Also can you add more info, like an example table you would like to to appear in R? — Green Demon
– Green Demon, Commented Apr 25, 2013 at 20:24
@Green Demon thanks for the package. The example table is pasted above and also on the link above. It is just the box labeled "Datos generales del concurso". — nopeva
– nopeva, Commented Apr 26, 2013 at 6:11
hmmm thats hard to do in R, esp since the links addresses seems to be in randomized or encoded. From my perspective, you may have to do it in VBA or HTML. Someone who is better with R may have a different answer. — Green Demon
– Green Demon, Commented Apr 26, 2013 at 13:44
@Green Demon Hi I have found the similar information in another web site but now without randomized links addresses. Could you please have a look and let me know what do you think? — nopeva
– nopeva, Commented Apr 28, 2013 at 11:20

Jeff Allen · Accepted Answer · 2013-04-25 16:51:01Z

1

To grab the content, check out the httr package. You could use something like

content (GET (url))

answered Apr 25, 2013 at 16:51

Jeff Allen

17.6k9 gold badges52 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

nopeva Over a year ago

@Thanks Jeff Allen I get lots of code with this command. Maybe you could provide a simple example to retrieve a piece of the data.

egonomist · Accepted Answer · 2013-10-07 11:35:28Z

1) it's a piece of cake to get Referencia

substrRight <- function(x, n){
  sapply(x, function(xx)
  substr(xx, (nchar(xx)-n+1), nchar(xx)))
}

library(XML)
u<-"http://www.boe.es/buscar/boe.php?campo%5B1%5D=DOC&dato%5B1%5D=edicto+auto+declaracion+concurso+CIF%20&campo%5B6%5D=FPU&dato%5B6%5D%5B0%5D=25%2F04%2F2013&dato%5B6%5D%5B1%5D=30%2F04%2F2013%20&sort_field%5B0%5D=fpu&sort_order%5B0%5D=desc&sort_field%5B1%5D=ref&sort_order%5B1%5D=asc&accion=Buscar" #link
doc1<-htmlParse(u) 'get html'
kbbRoot <- xmlRoot(doc1) #parse it into xml
els<-getNodeSet(kbbRoot,"//*[contains(concat( ' ', @class, ' ' ), concat( ' ', 'resultado-busqueda-link-defecto', ' ' ))]") #get all links by xpath
links<-sapply(els, function(el) xmlGetAttr(el, "href")) #get inner (start with .../)
links<-sapply(links, function(x)  substr(x,start=3,stop=nchar(x))) #delete ../  
links<-sapply(links, function(x)  paste("http://www.boe.es", x,sep=""))#generate correct link
Referencia<-sapply(links, function(x) substrRight(x,16)) # get referencia from links

2)CIF much more complicated. You've got to use regular expressions. Unfortunately I am not strong in it. So ask somebody else on forum :"wich regular expression should be used to obtain CIF value from the string?"

CIFRA<-function (u){
  doc1<-htmlParse(u)#get html
  kbbRoot <- xmlRoot(doc1)# parse it
  els<-getNodeSet(kbbRoot,"//*[contains(concat('', @class,''), concat('', 'parrafo', '' ))]")#select text
  l<-sapply(els, xmlValue) #analyse each sentences
  x<-regexpr(pattern="[A-Z][0-9]+",text=l)#Try to find CIF by using RegEXP
  #regexp return position in string
  ind<-which.max(x) #'number of row with CIF'
  st<- x[ind]-3 #start position
  en<-st+attr(x, "match.length")[ind]-1 #finish
  res<-substring(l[ind],st,en) #select text between start and finish
}

CIF<-sapply(links, function(x) CIFRA(x))

Collectives™ on Stack Overflow

retrieve data sequentially from web page in R

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related