1

I several HTML files that contain a tag for a name that I want to use for the actual file name. Example HTML File:

    <div class="top">SomethingFile</div>
    <a href="../files/15d705df3.txt"/>

Output: I want the SomethingFile tag to be the name of 15d705df3.txt

    15d705df3.txt --> SomethingFile.txt

I have about 800+ text and HTML files with this same format that I would like to rename. I have been attempting to get this working with awk, sed, and grep. But unfortunately I am at a loss and am stuck on creating the initial two variables and using these to rename the file.

0

4 Answers 4

2

awk, sed, and grep are not the right tools for this task, instead I recommend you

xmllint --html --xpath '/Xpath/expression' file.html

with a Xpath expression.

Basically

xmllint --html --xpath '//div[@class="top"]/text()' file.html

Finally

for f in *.html *.txt; do
    filename=$(xmllint --html --xpath '//div[@class="top"]/text()' "$f")
    mv "$f" "$filename.txt"
done
Sign up to request clarification or add additional context in comments.

1 Comment

I just realised that you want SomethingFile not 15d705df3.txt, POST edited accordingly
0

Loop over the files, use sed to extract the new name of the file and then rename the file.

for file in *
do
    name=$(sed -n 's|.*<div class="top">\(.*\)</div>|\1|p' "$file")
    mv "$file" "$name.txt"
done

2 Comments

The bad cthulhu way :/ codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html You don't even know if there's one or N items per file
Thanks for the inspiration.
0

One solution using perl with the help of the html parser HTML::TokeParser:

#!/usr/bin/env perl

use warnings;
use strict;
use HTML::TokeParser;
use File::Spec;

my ($newfile, $currentfile);

## Give as arguments the html files to process, like *.html
for ( @ARGV ) { 
    my $p = HTML::TokeParser->new( $_ ) or die;

    ## Search a "div" tag with the attribute "class" to value "top".
    while ( my $info = $p->get_tag( 'div' ) ) { 
        if ( $info->[1]{class} eq 'top' ) { 

            $newfile = $p->get_text;

            ## Omit next two tokens until following "a" tag (</div>, space).
            $info = $p->get_token for 1 .. 3;

            ## If tag is a start 'a' tag, extract file name of the href attribute.
            if ( $info->[0] eq 'S' &&
                 $info->[1] eq 'a' ) { 
                $currentfile = ( File::Spec->splitpath( $info->[2]{href} ) )[2];
                $newfile .= join q||, (split /(\.)/, $currentfile)[-2 .. -1];
            }   
            last;
        }   
    }   

    ## Rename file.
    if ( $newfile && $currentfile ) { 
        printf STDERR qq|Renaming --> %s <-- to --> %s <--\n|, $currentfile, $newfile;
        rename $currentfile, $newfile;
    }   
    $newfile = $currentfile = undef;
}

Run it like:

perl-5.14.2 script.pl *.html

And a result in one of my test, should be similar to:

Renaming --> 15d705df3.txt <-- to --> SomethingFile1.txt <--
Renaming --> 15d705dg6.txt <-- to --> SomethingFile2.txt <--

Comments

0

An answer inspired by @sputnick but using Xmlstarlet instead of xmllint.

xml sel -T -t -o "mv " -f -o " " -t -v 'string(//div[@class="top"])' -o ".txt" -nl *.html 

Gives:

mv t.html SomethingFile.txt
mv tt.html SomethingElse.txt

When you're happy with what you think it'll do.

xml sel -T -t -o "mv " -f -o " " -t -v 'string(//div[@class="top"])' -o ".txt" -nl *.html | sh

All credit to @sputnick for sowing the seed and enabling me to piggy back.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.