2

I'm parsing squid logs with Java. It seemed appropriate to use URL class. This class, however, makes a DNS request, which indefinitely slows down parsing. Are there other easy ways to extract hostname and port from an url?

Conditions

  • url schema might be ommited in squid logs
  • an absent (default) port should be derived for ftp, http, https protocols

Log example:

1288763851.129    295 10.10.100.10 TCP_MISS/200 435 GET http://win.mail.ru/cgi-bin/checknew? - DIRECT/217.69.128.52 text/plain
1288763881.110    275 10.10.100.10 TCP_MISS/200 434 GET http://win.mail.ru/cgi-bin/checknew? - DIRECT/217.69.128.52 text/plain
1288763883.093  60001 10.10.102.202 TCP_MISS/503 0 CONNECT www.update.microsoft.com:443 - DIRECT/- -
1288763884.301      0 10.10.102.202 NONE/400 3506 GET / - NONE/- text/html
1288763911.194    359 10.10.100.10 TCP_MISS/200 435 GET http://win.mail.ru/cgi-bin/checknew? - DIRECT/217.69.128.52 text/plain
1288763941.097    264 10.10.100.10 TCP_MISS/200 434 GET http://win.mail.ru/cgi-bin/checknew? - DIRECT/217.69.128.52 text/plain
1288763944.094  59777 10.10.102.202 TCP_MISS/503 0 CONNECT www.update.microsoft.com:443 - DIRECT/- -
1288763971.123    289 10.10.100.10 TCP_MISS/200 434 GET http://win.mail.ru/cgi-bin/checknew? - DIRECT/217.69.128.52 text/plain
1288764002.257   1421 10.10.100.10 TCP_MISS/200 435 GET http://win.mail.ru/cgi-bin/checknew? - DIRECT/217.69.128.52 text/plain

EDIT: I had to write my own class parser for this task. The idea is to use InetAddress if thestring has an IP or simple string for hostnames.

1
  • 1
    I wrote galimatias, a Java URL parsing library that you can use for the job. Once it parses the URL, you can get the host and check if it's a domain, a IPv4 or IPv6 address. It's still in early stages but it's quite solid for this use case. Commented Jan 2, 2014 at 0:17

2 Answers 2

1

You could try Restlet's Reference class.

Sign up to request clarification or add additional context in comments.

2 Comments

There is no restlet keyword in my debian distribution. I need more common solution.
If you program in Java, most libraries won't be bundled out of the box with distributions. If you're after an easy installation, you could consider a build/distribution system such as Maven (Restlet has its own Maven repository, which you could easily configure in your project).
1

Use the java.net.URI class.

3 Comments

I'm not surprised. It does parse "update.microsoft.com:443", which is about the only way you can use that string in Java.
Do you mean extra quotes are needed?
I mean a protocol is needed. SO turned that into a link so you can't see what I actually typed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.