parse text file into a data frame

Question

I need to parse a haproxy log file into a dataframe. A log file line that looks something like this.

Feb  6 12:14:14 localhost \
      haproxy[14389]: 10.0.1.2:33317 [06/Feb/2009:12:14:14.655] http-in \
      static/srv1 10/0/30/69/109 200 2750 - - ---- 1/1/1/1/0 0/0 {1wt.eu} \
      {} "GET /index.html HTTP/1.1"

The format definition is (from here):

 Field   Format                                Extract from the example above
      1   process_name '[' pid ']:'                            haproxy[14389]:
      2   client_ip ':' client_port                             10.0.1.2:33317
      3   '[' accept_date ']'                       [06/Feb/2009:12:14:14.655]
      4   frontend_name                                                http-in
      5   backend_name '/' server_name                             static/srv1
      6   Tq '/' Tw '/' Tc '/' Tr '/' Tt*                       10/0/30/69/109
      7   status_code                                                      200
      8   bytes_read*                                                     2750
      9   captured_request_cookie                                            -
     10   captured_response_cookie                                           -
     11   termination_state                                               ----
     12   actconn '/' feconn '/' beconn '/' srv_conn '/' retries*    1/1/1/1/0
     13   srv_queue '/' backend_queue                                      0/0
     14   '{' captured_request_headers* '}'                   {haproxy.1wt.eu}
     15   '{' captured_response_headers* '}'                                {}
     16   '"' http_request '"'                      "GET /index.html HTTP/1.1"

My current thinking is to loop through the file line-by-line, parsing each line and appending it to a data frame:

read.haproxy <- function(filename)
{
  process_name   <- c()
  client_ip      <- c()
  ...
  http_request   <- c()

  con<- file(filename, 'r')
  while (length(input<- readLines(con, n=1000)> 0))  {
    for (i in 1:length(input)){
      # regex to split line into variables
      # append values to vectors
    }
  }
  # append vector to dataframe and return
}

Question: Is this approach valid, or will it be inefficient? Is there a more conventional R way to do this?

It might be faster to read in the entire file and then process your object line-by-line. Next, since you apparently can use the same regex on every line, you might be able to build a function using Vectorize(regex(stuff)) . And finally, if you're up for some fun :-), you could code it up in c and use Rcpp for potentially some more speed improvement. — Carl Witthoft
– Carl Witthoft, Commented Dec 5, 2014 at 14:17
how would you read the entire file? would you mind adding some skeleton code to an answer and then I can credit you for it? — Chris Snow
– Chris Snow, Commented Dec 5, 2014 at 15:04
Chris, just foo<-readLines(con) should put the entire file into a vector; each element foo[j] is one record. Just loop over length(foo), applying your regex to each record in turn. HTH :-) — Carl Witthoft
– Carl Witthoft, Commented Dec 5, 2014 at 15:32
Performance is very slow on pure R implementation - it seems like Rcpp may be the best option. — Chris Snow
– Chris Snow, Commented Dec 6, 2014 at 8:43

Jim · Accepted Answer · 2014-12-05 18:57:09Z

rex has a vignette for parsing server logs. While the format is not exactly the same as your log you should be able to adapt it to your case fairly easily.

As far as reading the log in assuming the file fits in memory your best bet is to read the whole file first with readLines(), then the following will put each field into a data.frame column.

x <- "Feb  6 12:14:14 localhost haproxy[14389]: 10.0.1.2:33317 [06/Feb/2009:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - ---- 1/1/1/1/0 0/0 {1wt.eu} {} \"GET /index.html HTTP/1.1\""
library(rex)
re <- rex(

  capture(name = "process_name", alpha),
  "[",
    capture(name = "pid", digits),
  "]:",
  spaces,
  capture(name = "client_ip", any_of(digit, ".")),
  ":",
  capture(name = "client_port", digits),
  spaces,
  "[",
    capture(name = "accept_date", except_some_of("]")),
  "]",
  spaces,
  capture(name = "frontend_name", non_spaces),
  spaces,
  capture(name = "backend_name", except_some_of("/")),
  "/",
  capture(name = "server_name", non_spaces),
  spaces,
  capture(name = "Tq", some_of("-", digit)),
  "/",
  capture(name = "Tw", some_of("-", digit)),
  "/",
  capture(name = "Tc", some_of("-", digit)),
  "/",
  capture(name = "Tr", some_of("-", digit)),
  "/",
  capture(name = "Tt", some_of("+", digit)),
  spaces,
  capture(name = "status_code", digits),
  spaces,
  capture(name = "bytes_read", some_of("+", digit)),
  spaces,
  capture(name = "captured_request_cookie", non_spaces),
  spaces,
  capture(name = "captured_response_cookie", non_spaces),
  spaces,
  capture(name = "termination_state", non_spaces),
  spaces,
  capture(name = "actconn", digits),
  "/",
  capture(name = "feconn", digits),
  "/",
  capture(name = "beconn", digits),
  "/",
  capture(name = "srv_conn", digits),
  "/",
  capture(name = "retries", some_of("+", digit)),
  spaces,
  capture(name = "srv_queue", digits),
  "/",
  capture(name = "backend_queue", digits),
  spaces,
  "{",
    capture(name = "captured_request_headers", except_any_of("}")),
  "}",
  spaces,
  "{",
    capture(name = "captured_response_headers", except_any_of("}")),
  "}",
  spaces,
  double_quote,
    capture(name = "http_request", non_quotes),
  double_quote)

re_matches(x, re)

#>   process_name   pid client_ip client_port              accept_date
#> 1            y 14389  10.0.1.2       33317 06/Feb/2009:12:14:14.655
#>   frontend_name backend_name server_name Tq Tw Tc Tr  Tt status_code
#> 1       http-in       static        srv1 10  0 30 69 109         200
#>   bytes_read captured_request_cookie captured_response_cookie
#> 1       2750                       -                        -
#>   termination_state actconn feconn beconn srv_conn retries srv_queue
#> 1              ----       1      1      1        1       0         0
#>   backend_queue captured_request_headers captured_response_headers
#> 1             0                   1wt.eu                          
#>               http_request
#> 1 GET /index.html HTTP/1.1

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
Valid point, I added a full example of parsing the given text.

Collectives™ on Stack Overflow

parse text file into a data frame

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related