I need to parse a haproxy log file into a dataframe. A log file line that looks something like this.
Feb 6 12:14:14 localhost \
haproxy[14389]: 10.0.1.2:33317 [06/Feb/2009:12:14:14.655] http-in \
static/srv1 10/0/30/69/109 200 2750 - - ---- 1/1/1/1/0 0/0 {1wt.eu} \
{} "GET /index.html HTTP/1.1"
The format definition is (from here):
Field Format Extract from the example above
1 process_name '[' pid ']:' haproxy[14389]:
2 client_ip ':' client_port 10.0.1.2:33317
3 '[' accept_date ']' [06/Feb/2009:12:14:14.655]
4 frontend_name http-in
5 backend_name '/' server_name static/srv1
6 Tq '/' Tw '/' Tc '/' Tr '/' Tt* 10/0/30/69/109
7 status_code 200
8 bytes_read* 2750
9 captured_request_cookie -
10 captured_response_cookie -
11 termination_state ----
12 actconn '/' feconn '/' beconn '/' srv_conn '/' retries* 1/1/1/1/0
13 srv_queue '/' backend_queue 0/0
14 '{' captured_request_headers* '}' {haproxy.1wt.eu}
15 '{' captured_response_headers* '}' {}
16 '"' http_request '"' "GET /index.html HTTP/1.1"
My current thinking is to loop through the file line-by-line, parsing each line and appending it to a data frame:
read.haproxy <- function(filename)
{
process_name <- c()
client_ip <- c()
...
http_request <- c()
con<- file(filename, 'r')
while (length(input<- readLines(con, n=1000)> 0)) {
for (i in 1:length(input)){
# regex to split line into variables
# append values to vectors
}
}
# append vector to dataframe and return
}
Question: Is this approach valid, or will it be inefficient? Is there a more conventional R way to do this?
regexon every line, you might be able to build a function usingVectorize(regex(stuff)). And finally, if you're up for some fun :-), you could code it up incand useRcppfor potentially some more speed improvement.read.table.foo<-readLines(con)should put the entire file into a vector; each element foo[j] is one record. Just loop overlength(foo), applying yourregexto each record in turn. HTH :-)