I like using curl and the command line to process html pages.
Relative urls are a pain.
Is there some easy utility to make all relative urls absolute?
Ideally this would look something like
curlabsolute $URL | process
What you need is wget utulity:
Let's say we need to download a simple web-page given by http://www.littlewebhut.com/articles/simple_web_page/.
The command (the below used url is real, the command can be tested "as is"):
wget -O simple_page -k http://www.littlewebhut.com/articles/simple_web_page/
-O (--output-document=file) - The documents will not be written to the appropriate files, but all will be concatenated together and written to file.
-k (--convert-links) - After the download is complete, convert the links in the document to make them suitable for local viewing
I will just demonstrate some context html fragment from the mentioned web-page before downloading (online varsion):
...
<ul>
<li><a href="/" class="color-menu">Home</a></li>
<li><a href="/html/" class="color-menu">HTML</a></li>
<li><a href="/css/" class="color-menu">CSS</a></li>
<li><a href="/javascript/" class="color-menu">JavaScript/jQuery</a></li>
<li><a href="/inkscape/" class="color-menu">Inkscape</a></li>
<li><a href="/gimp/" class="color-menu">GIMP</a></li>
<li><a href="/blender/" class="color-menu">Blender</a></li>
<li><a href="/articles/" class="color-menu">Articles</a></li>
<li><a href="/contact/" class="color-menu">Contact</a></li>
</ul>
The same fragment after downloading, saved in the file simple_page:
...
<ul>
<li><a href="http://www.littlewebhut.com/" class="color-menu">Home</a></li>
<li><a href="http://www.littlewebhut.com/html/" class="color-menu">HTML</a></li>
<li><a href="http://www.littlewebhut.com/css/" class="color-menu">CSS</a></li>
<li><a href="http://www.littlewebhut.com/javascript/" class="color-menu">JavaScript/jQuery</a></li>
<li><a href="http://www.littlewebhut.com/inkscape/" class="color-menu">Inkscape</a></li>
<li><a href="http://www.littlewebhut.com/gimp/" class="color-menu">GIMP</a></li>
<li><a href="http://www.littlewebhut.com/blender/" class="color-menu">Blender</a></li>
<li><a href="http://www.littlewebhut.com/articles/" class="color-menu">Articles</a></li>
<li><a href="http://www.littlewebhut.com/contact/" class="color-menu">Contact</a></li>
</ul>
/hello.html. Using urls like this means that the website can be served from multiple domain names, using different protocols, or easily moved between domains. The downside is that the urls are not a unique identifier of a resource any more, but can only be interpreted together with the base page that you fetched.curlabsolutewhich both fetches a page and absolutize's the URLs this has access to a base url.