1

I am trying to parse the href inthe anchor tag from a text, i tried the following code

from flask import Flask,render_template
import requests
import re
app = Flask(__name__)
   @app.route('/')
   def products():
      getprd = requests.get('API')
      jsonobj = getprd.text
      produ= getprd.json()
      prd = produ['items'][0]['id']
      htmlcode = produ['items'][0]['description']
      htmlcodetxt =str(htmlcode)
return render_template('productdisp.html', 
prod=jsonobj, prd=prd, htmlcode=htmlcode)


if __name__ =='__main__':
app.run(debug=True)

and the htmlcodetxt containt the text

<p style="text-align: center;"><strong>Part Number:</strong></p><div style="text-align: center;"><span style="font-size: 16px;">product code</span></div><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Lumens:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>6600-7200 LM</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>CCT:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;">5700K</span><br> </p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Input Voltage:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>100-277VAC, 50-60Hz</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong><strong>Certificates:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>UL, DLC</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Warranty:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>5 Years <br></span></p><hr><p style="text-align: center;"><strong>DOWNLOADS:</strong><br></p><p style="text-align: center;"><br></p><p style="text-align: center;"><strong><a href="https://dl.dropbox.com/s/saa.pdf?dl=1" class="fakeButton">Specification Sheet</a><br></strong><br></p><p><br></p><p style="text-align: center;"><strong><a href="https://dl.dropbox.com/s/ds.png?dl=1" class="fakeButton2">Photometric Data</a><br></strong></p><p style="text-align: center;"><br></p><p style="text-align: center;"><img src="https://ul_png"> <img src="https://300x295_png"> </p><p style="text-align: center;"><br></p>

1 Answer 1

1

One way would be to use the HTMLParser module like this to parse the href link from the htmlcodetxt string.

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):

    # Parse the 'anchor' tag.
        if tag == "a":

        # Check the list of defined attributes
            for name, value in attrs:

            # If href is defined, print it.
                if name == "href":
                    print name, "=", value

# Declare it and feed it your HTML content that you want parsed for the href tag.
parser = MyHTMLParser()
parser.feed(htmlcodetxt)

I'm not sure how your app handler works, but perhaps you could try something like this?

from flask import Flask,render_template
import requests
import re
from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == "a":
            for name, value in attrs:
                if name == "href":
                    print name, "=", value


app = Flask(__name__)
   @app.route('/')
   def products():
      getprd = requests.get('API')
      jsonobj = getprd.text
      produ= getprd.json()
      prd = produ['items'][0]['id']
      htmlcode = produ['items'][0]['description']
      htmlcodetxt =str(htmlcode)

      parser = MyHTMLParser()
      parser.feed(htmlcodetxt)

return render_template('productdisp.html',
prod=jsonobj, prd=prd, htmlcode=htmlcode)


if __name__ =='__main__':
app.run(debug=True)

For example, without using flask, and with using the html code sample that you posted, the following works and returns the expected output.

#!/usr/bin/python

content = '<p style="text-align: center;"><strong>Part Number:</strong></p><div style="text-align: center;"><span style="font-size: 16px;">product code</span></div><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Lumens:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>6600-7200 LM</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>CCT:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;">5700K</span><br> </p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Input Voltage:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>100-277VAC, 50-60Hz</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong><strong>Certificates:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>UL, DLC</span><br></p><hr><p style="text-align: center;"><span style="font-size: 16px;"><strong>Warranty:</strong></span><br></p><p style="text-align: center;"><span style="font-size: 16px;"><strong></strong>5 Years <br></span></p><hr><p style="text-align: center;"><strong>DOWNLOADS:</strong><br></p><p style="text-align: center;"><br></p><p style="text-align: center;"><strong><a href="https://dl.dropbox.com/s/saa.pdf?dl=1" class="fakeButton">Specification Sheet</a><br></strong><br></p><p><br></p><p style="text-align: center;"><strong><a href="https://dl.dropbox.com/s/ds.png?dl=1" class="fakeButton2">Photometric Data</a><br></strong></p><p style="text-align: center;"><br></p><p style="text-align: center;"><img src="https://ul_png"> <img src="https://300x295_png"> </p><p style="text-align: center;"><br></p>'


from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == "a":
            for name, value in attrs:
                if name == "href":
                    print name, "=", value


parser = MyHTMLParser()
parser.feed(content)

Example output:

$ ./html_parse.py 
href = https://dl.dropbox.com/s/saa.pdf?dl=1
href = https://dl.dropbox.com/s/ds.png?dl=1
Sign up to request clarification or add additional context in comments.

1 Comment

If you just want the actual value, and not the "href = " modify the print statement in the MyHTMLParser class. Change: "print name, "=", value to "print value"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.