1

I am scraping the Newark Liberty International Airport's website to keep track of their daily schedules. Here is the piece of code I have developed:

library(rvest)

url <- read_html('https://www.airport-ewr.com/newark-departures-terminal-C?
tp=6&day=tomorrow')

population <- url %>% html_nodes(xpath = '//*[@id="flight_detail"]') %>% 
              html_text() %>% gsub(pattern = '\\t|\\r|\\n', replacement = ' ') %>% 
              trimws() %>% gsub(pattern = '\\s+', replacement = " ")

gsub() is for removing the leading and trailing whitespaces and extra spaces within the text. The code works well and I have attached the snippet of the output:

enter image description here

I want to convert this character string into a dataframe which would contain values as shown below:

enter image description here

Any help is appreciated !!

6
  • Can you please share data as text? Image will not help people to work on your problem. Commented Apr 4, 2018 at 20:53
  • 2
    Please do not show images of data, just give the data itself (preferably with an easy-to-copy format like dput(head(x)). This is absolutely a regular-expression problem, which means it will take a lot of work to make it robust. Is there another format in which you can retrieve that data? Commented Apr 4, 2018 at 20:55
  • It might be easier to parse as an XML Commented Apr 4, 2018 at 21:22
  • 1
    You could use trimws. What is the algorithm for splitting this string? What have you tried? Commented Apr 4, 2018 at 21:33
  • Please post the snippet as plaintext, to make your example reproducible, so people can copy-and-paste it. That's the startpoint for this question. Commented Apr 4, 2018 at 22:43

1 Answer 1

3

Try this out:

library(rvest)

url <- read_html('https://www.airport-ewr.com/newark-departures-terminal-C?tp=6&day=tomorrow')


population <- url %>% html_nodes(xpath = '//*[@id="flight_detail"]') %>% 
              html_text()

First we read in the raw text rows. Then I noticed that each column is separated by \n but sometimes there's more than one, so first we gsub out the extra \n delimiters, then string split by \n, and rbind the output into a data.frame

popDF <- as.data.frame(
  do.call('rbind',strsplit(gsub("(\\n)+", "\\\n",population),split="\n", fixed=TRUE))
)


  V1               V2                V3      V4       V5                V6 V7      V8                       V9
1      Austin  (AUS)   United Airlines  UA 2427 06:00 am Depart:  06:00 am  C Term. C  Scheduled - On-time [+]
2      Austin  (AUS)               SAS  SK 6868 06:00 am Depart:  06:00 am  C Term. C  Scheduled - On-time [+]
3      Boston  (BOS)   United Airlines  UA 1699 06:00 am Depart:  06:00 am  C Term. C  Scheduled - On-time [+]
4    Columbus  (CMH)          CommutAir C5 4973 06:00 am Depart:  06:00 am  C Term. C  Scheduled - On-time [+]
5    Columbus  (CMH)   United Airlines  UA 4973 06:00 am Depart:  06:00 am  C Term. C  Scheduled - On-time [+]
6     Detroit  (DTW)  Republic Airlines YX 3482 06:00 am Depart:  06:00 am  C Term. C  Scheduled - On-time [+]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.