2

I need to parse a haproxy log file into a dataframe. A log file line that looks something like this.

Feb  6 12:14:14 localhost \
      haproxy[14389]: 10.0.1.2:33317 [06/Feb/2009:12:14:14.655] http-in \
      static/srv1 10/0/30/69/109 200 2750 - - ---- 1/1/1/1/0 0/0 {1wt.eu} \
      {} "GET /index.html HTTP/1.1"

The format definition is (from here):

 Field   Format                                Extract from the example above
      1   process_name '[' pid ']:'                            haproxy[14389]:
      2   client_ip ':' client_port                             10.0.1.2:33317
      3   '[' accept_date ']'                       [06/Feb/2009:12:14:14.655]
      4   frontend_name                                                http-in
      5   backend_name '/' server_name                             static/srv1
      6   Tq '/' Tw '/' Tc '/' Tr '/' Tt*                       10/0/30/69/109
      7   status_code                                                      200
      8   bytes_read*                                                     2750
      9   captured_request_cookie                                            -
     10   captured_response_cookie                                           -
     11   termination_state                                               ----
     12   actconn '/' feconn '/' beconn '/' srv_conn '/' retries*    1/1/1/1/0
     13   srv_queue '/' backend_queue                                      0/0
     14   '{' captured_request_headers* '}'                   {haproxy.1wt.eu}
     15   '{' captured_response_headers* '}'                                {}
     16   '"' http_request '"'                      "GET /index.html HTTP/1.1"

My current thinking is to loop through the file line-by-line, parsing each line and appending it to a data frame:

read.haproxy <- function(filename)
{
  process_name   <- c()
  client_ip      <- c()
  ...
  http_request   <- c()

  con<- file(filename, 'r')
  while (length(input<- readLines(con, n=1000)> 0))  {
    for (i in 1:length(input)){
      # regex to split line into variables
      # append values to vectors
    }
  }
  # append vector to dataframe and return
}

Question: Is this approach valid, or will it be inefficient? Is there a more conventional R way to do this?

5
  • 1
    It might be faster to read in the entire file and then process your object line-by-line. Next, since you apparently can use the same regex on every line, you might be able to build a function using Vectorize(regex(stuff)) . And finally, if you're up for some fun :-), you could code it up in c and use Rcpp for potentially some more speed improvement. Commented Dec 5, 2014 at 14:17
  • how would you read the entire file? would you mind adding some skeleton code to an answer and then I can credit you for it? Commented Dec 5, 2014 at 15:04
  • You should be able to find a solution using read.table . Commented Dec 5, 2014 at 15:32
  • 1
    Chris, just foo<-readLines(con) should put the entire file into a vector; each element foo[j] is one record. Just loop over length(foo), applying your regex to each record in turn. HTH :-) Commented Dec 5, 2014 at 15:32
  • Performance is very slow on pure R implementation - it seems like Rcpp may be the best option. Commented Dec 6, 2014 at 8:43

1 Answer 1

5

rex has a vignette for parsing server logs. While the format is not exactly the same as your log you should be able to adapt it to your case fairly easily.

As far as reading the log in assuming the file fits in memory your best bet is to read the whole file first with readLines(), then the following will put each field into a data.frame column.

x <- "Feb  6 12:14:14 localhost haproxy[14389]: 10.0.1.2:33317 [06/Feb/2009:12:14:14.655] http-in static/srv1 10/0/30/69/109 200 2750 - - ---- 1/1/1/1/0 0/0 {1wt.eu} {} \"GET /index.html HTTP/1.1\""
library(rex)
re <- rex(

  capture(name = "process_name", alpha),
  "[",
    capture(name = "pid", digits),
  "]:",
  spaces,
  capture(name = "client_ip", any_of(digit, ".")),
  ":",
  capture(name = "client_port", digits),
  spaces,
  "[",
    capture(name = "accept_date", except_some_of("]")),
  "]",
  spaces,
  capture(name = "frontend_name", non_spaces),
  spaces,
  capture(name = "backend_name", except_some_of("/")),
  "/",
  capture(name = "server_name", non_spaces),
  spaces,
  capture(name = "Tq", some_of("-", digit)),
  "/",
  capture(name = "Tw", some_of("-", digit)),
  "/",
  capture(name = "Tc", some_of("-", digit)),
  "/",
  capture(name = "Tr", some_of("-", digit)),
  "/",
  capture(name = "Tt", some_of("+", digit)),
  spaces,
  capture(name = "status_code", digits),
  spaces,
  capture(name = "bytes_read", some_of("+", digit)),
  spaces,
  capture(name = "captured_request_cookie", non_spaces),
  spaces,
  capture(name = "captured_response_cookie", non_spaces),
  spaces,
  capture(name = "termination_state", non_spaces),
  spaces,
  capture(name = "actconn", digits),
  "/",
  capture(name = "feconn", digits),
  "/",
  capture(name = "beconn", digits),
  "/",
  capture(name = "srv_conn", digits),
  "/",
  capture(name = "retries", some_of("+", digit)),
  spaces,
  capture(name = "srv_queue", digits),
  "/",
  capture(name = "backend_queue", digits),
  spaces,
  "{",
    capture(name = "captured_request_headers", except_any_of("}")),
  "}",
  spaces,
  "{",
    capture(name = "captured_response_headers", except_any_of("}")),
  "}",
  spaces,
  double_quote,
    capture(name = "http_request", non_quotes),
  double_quote)

re_matches(x, re)

#>   process_name   pid client_ip client_port              accept_date
#> 1            y 14389  10.0.1.2       33317 06/Feb/2009:12:14:14.655
#>   frontend_name backend_name server_name Tq Tw Tc Tr  Tt status_code
#> 1       http-in       static        srv1 10  0 30 69 109         200
#>   bytes_read captured_request_cookie captured_response_cookie
#> 1       2750                       -                        -
#>   termination_state actconn feconn beconn srv_conn retries srv_queue
#> 1              ----       1      1      1        1       0         0
#>   backend_queue captured_request_headers captured_response_headers
#> 1             0                   1wt.eu                          
#>               http_request
#> 1 GET /index.html HTTP/1.1
Sign up to request clarification or add additional context in comments.

2 Comments

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
Valid point, I added a full example of parsing the given text.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.