2

I have downloaded my facebook data. It contains a htm file with all my contacts. I would like to read it in with R, and create a contact.csv.

The usual structure is:

<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: [email protected]</li><li>contact: +123456789</li></ul></span></td></tr>

but some contacts may miss the phone number

<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: [email protected]</li></ul></span></td></tr>

while some miss the email

<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: +123456789</li></ul></span></td></tr>

The csv should have the structure Firstname Lastname; email; tel number

I have tried:

library(rvest)
library(stringr)

html <- read_html("contact_info.htm")
p_nodes <- html %>% html_nodes('tr')
p_nodes_text <- p_nodes %>% html_text()
write.csv(p_nodes_text, "contact.csv")

Which creates me the csv, but unfortunately merges names with "contact:" and does not create separate columns and does not allow to have "NA" for missing either phone numbers or emails.

How could I enhance my code to accomplish this? Thanks

1
  • 1
    The stringr library you included has grep. Regular expressions would be your most versatile solution. Commented Apr 15, 2018 at 23:42

1 Answer 1

1

You can use regexpr to identify the email & the telephon number :

xml1 <- '<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: [email protected]</li><li>contact: +123456789</li></ul></span></td></tr>'
xml2 <- '<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: [email protected]</li></ul></span></td></tr>'
xml3 <- '<tr><td>Firstname Lastname</td><td><span class="meta"><ul><li>contact: +123456789</li></ul></span></td></tr>'
docs <- c(xml1,xml2,xml3)

library(rvest)

df <- NULL

for ( doc in docs) {
 page <- read_html(doc)
 name <- page %>% html_nodes("tr td:first-child") %>% html_text()
 meta <- page %>% html_nodes("span.meta li") %>% html_text
 ind_mail <- grep(".{1,}\\@.{1,}\\..{1,}",meta)
 if(length(ind_mail)>0) mail <- meta[ind_mail] else mail <- "UNKWN"
 ind_tel <- grep("[0-9]{6,}$",meta)
 if(length(ind_tel)>0) tel <- meta[ind_tel] else tel <- "UNKWN"
 res <- cbind(name,mail,tel)
 df <- rbind(df,res)
}

Hope that will helps ,

Gottavianoni

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.