1

I have some HTML:

<td class="course-section-type"><span class="text-capitalize">lecture (5)</span></td>
<td class="course-section-meeting">
   <table class="no-borders" width="100%">
      <tbody>
         <tr>
            <td width="23%">MWF</td>
            <td width="55%">11:30 AM - 12:20 PM</td>
            <td width="22%"><span><a href="http://myurl.com" target="_blank">MGH</a> <span class="sr-only">building room</span> 389</span></td>
         </tr>
      </tbody>
   </table>
</td>
<td class="course-section-sln">00000</td>    

I'd like to extract the values of top-level "class" attributes and map them to a list of lower level text. For the above HTML, that would look something like:

data = {
    "course-section-type": ["lecture (5)"],
    "course-section-meeting": ["MWF", "11:30 AM - 12:20 PM", "MGH", "building room", "389"],
    "course-section-sln": ["00000"]
}    

I know that I can extract all the text with soup.findAll('td').text, but I don't know how to traverse the html tree nor how to extract the value of a tag attribute. How would I go about doing this?

Any help is appreciated.

1
  • Do all the top level td tags class values contain course-section- ? And how do you see the final structure when doing this for lots more source lines? Commented Nov 18, 2018 at 5:50

2 Answers 2

2

Figured it out. Turns out BeautifulSoup provides a keyword argument findAll(text=True) that finds all the text under a certain tag (using inorder traversal) and puts it in a list.

d = {}
for tag in line.findAll('td'):
    if tag.get("class") and "course" in tag.get("class")[0]:
        d[tag.get("class")[0]] = [text.strip() for text in tag.findAll(text=True)]
>>> d
{"course-section-type": ["lecture (5)"], 
"course-section-meeting": ["MWF", "11:30 AM - 12:20 PM", "MGH", "building room", 
"389"], "course-section-sln": ["00000"]}    
Sign up to request clarification or add additional context in comments.

Comments

0

solution is extract everything in this pattern,

cause its table in table, so the schema has to be fixed, otherwise nexttime when it changes, everything breaks again

course-section-type is outer table first <td> text

course-section-meeting is inner table everything text

course-section-sln is outer table third <td> text

2 Comments

I appreciate the comment, but I was looking for a more generalized solution. The actual html for this is ~4000 lines--I can't use a hardcoded pattern for all of it.
if you find a solution , do tell me, we were dealing with wiki dump before, the solution is no where close

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.