Python BeautifulSoup - extract text and attribute values

Question

I have some HTML:

<td class="course-section-type"><span class="text-capitalize">lecture (5)</span></td>
<td class="course-section-meeting">
   <table class="no-borders" width="100%">
      <tbody>
         <tr>
            <td width="23%">MWF</td>
            <td width="55%">11:30 AM - 12:20 PM</td>
            <td width="22%"><span><a href="http://myurl.com" target="_blank">MGH</a> <span class="sr-only">building room</span> 389</span></td>
         </tr>
      </tbody>
   </table>
</td>
<td class="course-section-sln">00000</td>

I'd like to extract the values of top-level "class" attributes and map them to a list of lower level text. For the above HTML, that would look something like:

data = {
    "course-section-type": ["lecture (5)"],
    "course-section-meeting": ["MWF", "11:30 AM - 12:20 PM", "MGH", "building room", "389"],
    "course-section-sln": ["00000"]
}

I know that I can extract all the text with soup.findAll('td').text, but I don't know how to traverse the html tree nor how to extract the value of a tag attribute. How would I go about doing this?

Any help is appreciated.

Do all the top level td tags class values contain course-section- ? And how do you see the final structure when doing this for lots more source lines? — QHarr
– QHarr, Commented Nov 18, 2018 at 5:50

Daniel Q · Accepted Answer · 2018-11-18 06:21:02Z

2

Figured it out. Turns out BeautifulSoup provides a keyword argument findAll(text=True) that finds all the text under a certain tag (using inorder traversal) and puts it in a list.

d = {}
for tag in line.findAll('td'):
    if tag.get("class") and "course" in tag.get("class")[0]:
        d[tag.get("class")[0]] = [text.strip() for text in tag.findAll(text=True)]
>>> d
{"course-section-type": ["lecture (5)"], 
"course-section-meeting": ["MWF", "11:30 AM - 12:20 PM", "MGH", "building room", 
"389"], "course-section-sln": ["00000"]}

edited Nov 18, 2018 at 6:21

answered Nov 18, 2018 at 6:13

Daniel Q

1671 gold badge2 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

陈海栋 · Accepted Answer · 2018-11-18 05:30:59Z

0

solution is extract everything in this pattern,

cause its table in table, so the schema has to be fixed, otherwise nexttime when it changes, everything breaks again

course-section-type is outer table first <td> text

course-section-meeting is inner table everything text

course-section-sln is outer table third <td> text

answered Nov 18, 2018 at 5:30

陈海栋

805 bronze badges

2 Comments

Daniel Q Over a year ago

I appreciate the comment, but I was looking for a more generalized solution. The actual html for this is ~4000 lines--I can't use a hardcoded pattern for all of it.

陈海栋 Over a year ago

if you find a solution , do tell me, we were dealing with wiki dump before, the solution is no where close

Collectives™ on Stack Overflow

Python BeautifulSoup - extract text and attribute values

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related