0

I have done an advanced search in a web and get some results. For each result I'm interested in extracting 2 fields, "Referencia:" and "CIF".

#This is the url with the results of the search
url="http://www.boe.es/buscar/boe.php?campo%5B1%5D=DOC&dato%5B1%5D=edicto+auto+declaracion+concurso+CIF
&campo%5B6%5D=FPU&dato%5B6%5D%5B0%5D=25%2F04%2F2013&dato%5B6%5D%5B1%5D=30%2F04%2F2013
&sort_field%5B0%5D=fpu&sort_order%5B0%5D=desc&sort_field%5B1%5D=ref&sort_order%5B1%5D=asc&accion=Buscar"

#This is the url of one of the results.
example=http://www.boe.es/buscar/doc.php?id=BOE-B-2013-15895

The CIF field usually of the form X00000000 or X-00000000 with X=c("A","B") and 0=0:9 and The Referencia field is BOE-B-2013-15895 in the example and the CIF B-32210196

Could you help me to do it from R?

5
  • check out the XML library in R Commented Apr 25, 2013 at 20:22
  • Also can you add more info, like an example table you would like to to appear in R? Commented Apr 25, 2013 at 20:24
  • @Green Demon thanks for the package. The example table is pasted above and also on the link above. It is just the box labeled "Datos generales del concurso". Commented Apr 26, 2013 at 6:11
  • hmmm thats hard to do in R, esp since the links addresses seems to be in randomized or encoded. From my perspective, you may have to do it in VBA or HTML. Someone who is better with R may have a different answer. Commented Apr 26, 2013 at 13:44
  • @Green Demon Hi I have found the similar information in another web site but now without randomized links addresses. Could you please have a look and let me know what do you think? Commented Apr 28, 2013 at 11:20

2 Answers 2

1

To grab the content, check out the httr package. You could use something like

content (GET (url))
Sign up to request clarification or add additional context in comments.

1 Comment

@Thanks Jeff Allen I get lots of code with this command. Maybe you could provide a simple example to retrieve a piece of the data.
1

1) it's a piece of cake to get Referencia

substrRight <- function(x, n){
  sapply(x, function(xx)
  substr(xx, (nchar(xx)-n+1), nchar(xx)))
}

library(XML)
u<-"http://www.boe.es/buscar/boe.php?campo%5B1%5D=DOC&dato%5B1%5D=edicto+auto+declaracion+concurso+CIF%20&campo%5B6%5D=FPU&dato%5B6%5D%5B0%5D=25%2F04%2F2013&dato%5B6%5D%5B1%5D=30%2F04%2F2013%20&sort_field%5B0%5D=fpu&sort_order%5B0%5D=desc&sort_field%5B1%5D=ref&sort_order%5B1%5D=asc&accion=Buscar" #link
doc1<-htmlParse(u) 'get html'
kbbRoot <- xmlRoot(doc1) #parse it into xml
els<-getNodeSet(kbbRoot,"//*[contains(concat( ' ', @class, ' ' ), concat( ' ', 'resultado-busqueda-link-defecto', ' ' ))]") #get all links by xpath
links<-sapply(els, function(el) xmlGetAttr(el, "href")) #get inner (start with .../)
links<-sapply(links, function(x)  substr(x,start=3,stop=nchar(x))) #delete ../  
links<-sapply(links, function(x)  paste("http://www.boe.es", x,sep=""))#generate correct link
Referencia<-sapply(links, function(x) substrRight(x,16)) # get referencia from links

2)CIF much more complicated. You've got to use regular expressions. Unfortunately I am not strong in it. So ask somebody else on forum :"wich regular expression should be used to obtain CIF value from the string?"

CIFRA<-function (u){
  doc1<-htmlParse(u)#get html
  kbbRoot <- xmlRoot(doc1)# parse it
  els<-getNodeSet(kbbRoot,"//*[contains(concat('', @class,''), concat('', 'parrafo', '' ))]")#select text
  l<-sapply(els, xmlValue) #analyse each sentences
  x<-regexpr(pattern="[A-Z][0-9]+",text=l)#Try to find CIF by using RegEXP
  #regexp return position in string
  ind<-which.max(x) #'number of row with CIF'
  st<- x[ind]-3 #start position
  en<-st+attr(x, "match.length")[ind]-1 #finish
  res<-substring(l[ind],st,en) #select text between start and finish
}  

CIF<-sapply(links, function(x) CIFRA(x))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.