0

I am getting html with curl and need to extract only the second table statement. Mind that the curled html is a single string and not formated. For better explaination see the following: (... stands for more html)

...
<table width="100%" cellpadding="0" cellspacing="0" class="table">
...
</table>
...
#I need to extract the following table
#from here
<table width="100%" cellpadding="4">
...
</table> #to this
...

I tried multiple SED lines so far, also I think that trying to match the second table like this is not the smooth way:

sed -n '/<table width="100%" cellpadding="4"/,/table>/p'
5
  • Are you married to using sed? It would be more robust to use an HTML/XML parser instead. Commented Nov 2, 2019 at 0:36
  • No whatever does the job would be great Commented Nov 2, 2019 at 0:37
  • do you want to print the whole <table> statement or just the contents inside the <table> statement? Commented Nov 2, 2019 at 1:01
  • @curusarn Whole statement Commented Nov 2, 2019 at 1:31
  • I have put together a script that prints the table statement. (see my answer below) Be sure to leave a comment if it doesn't work for you. Commented Nov 2, 2019 at 12:56

2 Answers 2

2

An html parser would be better, but you can use awk like this:

awk '/<table width="100%" cellpadding="4">/ {f=1} f; /<\/table>/ {f=0}' file
<table width="100%" cellpadding="4">
...
</table> #to this
  • /<table width="100%" cellpadding="4">/ {f=1} when start is found set flag f to true
  • f; if flage f is true, do default action, print line.
  • /<\/table>/ {f=0} when end is found, clear flag f to stop print.

This could also be used, but like the flag control better:

awk '/<table width="100%" cellpadding="4">/,/<\/table>/' file
<table width="100%" cellpadding="4">
...
</table> #to this
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your answer. For some reason it still prints at the first table where cellpadding="0" :/
@loyd I just cut and past data from your post and works fine on ubuntu 18.04. Should work on most system.
1

Save the script below as script.py and run it like this:

python3 script.py input.html

This script parses the HTML and checks for the attributes (width and cellpadding). The advantage of this approach is that if you change the formatting of the HTML file it will still work because the script parses the HTML instead of relying on exact string matching.

from html.parser import HTMLParser
import sys

def print_tag(tag, attrs, end=False):
    line = "<" 
    if end:
        line += "/"
    line += tag
    for attr, value in attrs:
        line += " " + attr + '="' + value + '"'
    print(line + ">", end="")

if len(sys.argv) < 2:
    print("ERROR: expected argument - filename")
    sys.exit(1)

with open(sys.argv[1], 'r', encoding='cp1252') as content_file:
    content = content_file.read()

do_print = False

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        global do_print
        if tag == "table":
            if ("width", "100%") in attrs and ("cellpadding", "4") in attrs:
                do_print = True
        if do_print:
            print_tag(tag, attrs)

    def handle_endtag(self, tag):
        global do_print
        if do_print:
            print_tag(tag, attrs=(), end=True)
            if tag == "table":
                do_print = False

    def handle_data(self, data):
        global do_print
        if do_print:
            print(data, end="")

parser = MyHTMLParser()
parser.feed(content)

5 Comments

Thanks for your time. I get the following error: Traceback (most recent call last): File "script.py", line 11, in <module> content = content_file.read() File "/usr/lib/python3.7/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 230: invalid start byte
Try now. Do you know what encoding is your file in?
Thank you so much, works like a charm! I got all the content now. Only thing missing is, that every html tag between the table tag like: <tr> </tr> , <td> </td> and <thead> </thead> is missing. Like I said the content is there but the format tags are missing
@loyd Oh, such a dumb mistake on my part. Try now.
I really appreciate your help and work! You helped me out with this one! Thanks for your time :) Have a good one!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.