5

I have a html file like this in a subdirectory the_files

<div class='log'>start</div>
<div class='ts'>2017-03-14 09:17:52.859 +0800&nbsp;</div><div class='log'>bla bla bla</div>
<div class='ts'>2017-03-14 09:17:55.619 +0800&nbsp;</div><div class='log'>aba aba aba</div>
...
...

I want to extract the string in each tag and print it like this on terminal

2017-03-14 09:17:52.859 +0800 , bla bla bla
2017-03-14 09:17:55.619 +0800 , aba aba aba
...
...

I want to ignore the first line of <div class='log'>start</div>.

My code so far

from bs4 import BeautifulSoup

path = "the_files/"
def do_task_html():
    dir_path = os.listdir(path)
    for file in dir_path:
        if file.endswith(".html"):
            soup = BeautifulSoup(open(path+file))
            item1 = [element.text for element in soup.find_all("div", "ts")]
            string1 = ''.join(item1)
            item2 = [element.text for element in soup.find_all("div", "log")]
            string2 = ''.join(item2)
            print string1 + "," + string2

This code produces result as follows

2017-03-14 09:17:52.859 +0800 2017-03-14 09:17:55.619 +0800 , start bla bla bla  aba aba aba ... ...

Is there a way to fix this?

Thank you for your help.

1 Answer 1

2

Fetch each div by class get its text and their next_sibling text.

for div in soup.find_all("div", class_="ts"):
    print ("%s, %s") % (div.get_text(strip=True), div.next_sibling.get_text(strip=True))

Outputs:

2017-03-14 09:17:52.859 +0800, bla bla bla
2017-03-14 09:17:55.619 +0800, aba aba aba
Sign up to request clarification or add additional context in comments.

1 Comment

thank you for your fast response and great answer. It worked!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.