Extract string from HTML text file and use it to rename a file

Question

I several HTML files that contain a tag for a name that I want to use for the actual file name. Example HTML File:

    <div class="top">SomethingFile</div>
    <a href="../files/15d705df3.txt"/>

Output: I want the SomethingFile tag to be the name of 15d705df3.txt

    15d705df3.txt --> SomethingFile.txt

I have about 800+ text and HTML files with this same format that I would like to rename. I have been attempting to get this working with awk, sed, and grep. But unfortunately I am at a loss and am stuck on creating the initial two variables and using these to rename the file.

Gilles Quénot · Accepted Answer · 2013-02-21 19:33:56Z

2

awk, sed, and grep are not the right tools for this task, instead I recommend you

xmllint --html --xpath '/Xpath/expression' file.html

with a Xpath expression.

Basically

xmllint --html --xpath '//div[@class="top"]/text()' file.html

Finally

for f in *.html *.txt; do
    filename=$(xmllint --html --xpath '//div[@class="top"]/text()' "$f")
    mv "$f" "$filename.txt"
done

edited Feb 21, 2013 at 19:33

answered Feb 20, 2013 at 15:14

Gilles Quénot

188k43 gold badges232 silver badges229 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Gilles Quénot Over a year ago

I just realised that you want SomethingFile not 15d705df3.txt, POST edited accordingly

dogbane · Accepted Answer · 2013-02-20 15:16:40Z

0

Loop over the files, use sed to extract the new name of the file and then rename the file.

for file in *
do
    name=$(sed -n 's|.*<div class="top">\(.*\)</div>|\1|p' "$file")
    mv "$file" "$name.txt"
done

answered Feb 20, 2013 at 15:16

dogbane

276k77 gold badges407 silver badges415 bronze badges

2 Comments

Gilles Quénot Over a year ago

The bad cthulhu way :/ codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html You don't even know if there's one or N items per file

roachmmflhyr Over a year ago

Thanks for the inspiration.

Birei · Accepted Answer · 2013-02-20 16:17:50Z

One solution using perl with the help of the html parser HTML::TokeParser:

#!/usr/bin/env perl

use warnings;
use strict;
use HTML::TokeParser;
use File::Spec;

my ($newfile, $currentfile);

## Give as arguments the html files to process, like *.html
for ( @ARGV ) { 
    my $p = HTML::TokeParser->new( $_ ) or die;

    ## Search a "div" tag with the attribute "class" to value "top".
    while ( my $info = $p->get_tag( 'div' ) ) { 
        if ( $info->[1]{class} eq 'top' ) { 

            $newfile = $p->get_text;

            ## Omit next two tokens until following "a" tag (</div>, space).
            $info = $p->get_token for 1 .. 3;

            ## If tag is a start 'a' tag, extract file name of the href attribute.
            if ( $info->[0] eq 'S' &&
                 $info->[1] eq 'a' ) { 
                $currentfile = ( File::Spec->splitpath( $info->[2]{href} ) )[2];
                $newfile .= join q||, (split /(\.)/, $currentfile)[-2 .. -1];
            }   
            last;
        }   
    }   

    ## Rename file.
    if ( $newfile && $currentfile ) { 
        printf STDERR qq|Renaming --> %s <-- to --> %s <--\n|, $currentfile, $newfile;
        rename $currentfile, $newfile;
    }   
    $newfile = $currentfile = undef;
}

Run it like:

perl-5.14.2 script.pl *.html

And a result in one of my test, should be similar to:

Renaming --> 15d705df3.txt <-- to --> SomethingFile1.txt <--
Renaming --> 15d705dg6.txt <-- to --> SomethingFile2.txt <--

sotapme · Accepted Answer · 2013-02-20 18:46:20Z

0

An answer inspired by @sputnick but using Xmlstarlet instead of xmllint.

xml sel -T -t -o "mv " -f -o " " -t -v 'string(//div[@class="top"])' -o ".txt" -nl *.html

Gives:

mv t.html SomethingFile.txt
mv tt.html SomethingElse.txt

When you're happy with what you think it'll do.

xml sel -T -t -o "mv " -f -o " " -t -v 'string(//div[@class="top"])' -o ".txt" -nl *.html | sh

All credit to @sputnick for sowing the seed and enabling me to piggy back.

answered Feb 20, 2013 at 18:46

sotapme

4,9432 gold badges21 silver badges21 bronze badges

Collectives™ on Stack Overflow

Extract string from HTML text file and use it to rename a file

4 Answers 4

Basically

Finally

1 Comment

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Basically

Finally

1 Comment

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related