Extracting string from html text

Question

I am getting html with curl and need to extract only the second table statement. Mind that the curled html is a single string and not formated. For better explaination see the following: (... stands for more html)

...
<table width="100%" cellpadding="0" cellspacing="0" class="table">
...
</table>
...
#I need to extract the following table
#from here
<table width="100%" cellpadding="4">
...
</table> #to this
...

I tried multiple SED lines so far, also I think that trying to match the second table like this is not the smooth way:

sed -n '/<table width="100%" cellpadding="4"/,/table>/p'

Are you married to using sed? It would be more robust to use an HTML/XML parser instead. — curusarn
– curusarn, Commented Nov 2, 2019 at 0:36
do you want to print the whole <table> statement or just the contents inside the <table> statement? — curusarn
– curusarn, Commented Nov 2, 2019 at 1:01
I have put together a script that prints the table statement. (see my answer below) Be sure to leave a comment if it doesn't work for you. — curusarn
– curusarn, Commented Nov 2, 2019 at 12:56

Jotne · Accepted Answer · 2019-11-02 06:25:30Z

2

An html parser would be better, but you can use awk like this:

awk '/<table width="100%" cellpadding="4">/ {f=1} f; /<\/table>/ {f=0}' file
<table width="100%" cellpadding="4">
...
</table> #to this

/<table width="100%" cellpadding="4">/ {f=1} when start is found set flag f to true
f; if flage f is true, do default action, print line.
/<\/table>/ {f=0} when end is found, clear flag f to stop print.

This could also be used, but like the flag control better:

awk '/<table width="100%" cellpadding="4">/,/<\/table>/' file
<table width="100%" cellpadding="4">
...
</table> #to this

answered Nov 2, 2019 at 6:25

Jotne

41.7k13 gold badges54 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

loyd Over a year ago

Thanks for your answer. For some reason it still prints at the first table where cellpadding="0" :/

Jotne Over a year ago

@loyd I just cut and past data from your post and works fine on ubuntu 18.04. Should work on most system.

curusarn · Accepted Answer · 2019-11-04 16:47:37Z

1

Save the script below as script.py and run it like this:

python3 script.py input.html

This script parses the HTML and checks for the attributes (width and cellpadding). The advantage of this approach is that if you change the formatting of the HTML file it will still work because the script parses the HTML instead of relying on exact string matching.

from html.parser import HTMLParser
import sys

def print_tag(tag, attrs, end=False):
    line = "<" 
    if end:
        line += "/"
    line += tag
    for attr, value in attrs:
        line += " " + attr + '="' + value + '"'
    print(line + ">", end="")

if len(sys.argv) < 2:
    print("ERROR: expected argument - filename")
    sys.exit(1)

with open(sys.argv[1], 'r', encoding='cp1252') as content_file:
    content = content_file.read()

do_print = False

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        global do_print
        if tag == "table":
            if ("width", "100%") in attrs and ("cellpadding", "4") in attrs:
                do_print = True
        if do_print:
            print_tag(tag, attrs)

    def handle_endtag(self, tag):
        global do_print
        if do_print:
            print_tag(tag, attrs=(), end=True)
            if tag == "table":
                do_print = False

    def handle_data(self, data):
        global do_print
        if do_print:
            print(data, end="")

parser = MyHTMLParser()
parser.feed(content)

edited Nov 4, 2019 at 16:47

answered Nov 2, 2019 at 1:05

curusarn

4034 silver badges11 bronze badges

5 Comments

loyd Over a year ago

Thanks for your time. I get the following error: Traceback (most recent call last): File "script.py", line 11, in <module> content = content_file.read() File "/usr/lib/python3.7/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 230: invalid start byte

curusarn Over a year ago

Try now. Do you know what encoding is your file in?

loyd Over a year ago

Thank you so much, works like a charm! I got all the content now. Only thing missing is, that every html tag between the table tag like: <tr> </tr> , <td> </td> and <thead> </thead> is missing. Like I said the content is there but the format tags are missing

curusarn Over a year ago

@loyd Oh, such a dumb mistake on my part. Try now.

loyd Over a year ago

I really appreciate your help and work! You helped me out with this one! Thanks for your time :) Have a good one!

Collectives™ on Stack Overflow

Extracting string from html text

2 Answers 2

2 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related