22

I've written a simple script to parse XML chat logs using the BeautifulSoup module. The standard soup.prettify() works ok except chat logs have a lot of fluff in them. You can see both the script code and some of the XML input file I'm working with below:

Code

import sys
from BeautifulSoup import BeautifulSoup as Soup

def parseLog(file):
    file = sys.argv[1]
    handler = open(file).read()
    soup = Soup(handler)
    print soup.prettify()

if __name__ == "__main__":
    parseLog(sys.argv[1])

Test XML Input

<?xml version="1.0"?>
<?xml-stylesheet type='text/xsl' href='MessageLog.xsl'?>
<Log FirstSessionID="1" LastSessionID="2"><Message Date="10/31/2010" Time="3:43:48 PM"     DateTime="2010-10-31T20:43:48.937Z" SessionID="1"><From><User FriendlyName="Jon"/></From>    <To><User FriendlyName="Bill"/></To><Text Style="font-family:Segoe UI; color:#000000; ">hey, what's up?</Text></Message>
<Message Date="10/31/2010" Time="3:44:03 PM" DateTime="2010-10-15T20:44:03.421Z" SessionID="1"><From><User FriendlyName="Jon"/></From><To><User FriendlyName="Bill"/></To><Text Style="font-family:Segoe UI; color:#000000; ">Got your message</Text></Message> 
<Message Date="10/31/2010" Time="3:44:31 PM" DateTime="2010-10-15T20:44:31.390Z" SessionID="2"><From><User FriendlyName="Bill"/></From><To><User FriendlyName="Jon"/></To><Text Style="font-family:Segoe UI; color:#000000; ">oh, great</Text></Message>
<Message Date="10/31/2010" Time="3:44:59 PM" DateTime="2010-10-15T20:44:59.281Z" SessionID="2"><From><User FriendlyName="Bill"/></From><To><User FriendlyName="Jon"/></To><Text Style="font-family:Segoe UI; color:#000000; ">hey, i gotta run</Text></Message>

I'm wanting to be able to output this into a format like the following or at least something that is more readable than pure XML:

Jon: Hey, what's up? [10/31/10 @ 3:43p]

Jon: Got your message [10/31/10 @ 3:44p]

Bill: oh, great [10/31/10 @ 3:44p]

etc.. I've heard some decent things about the PyParsing module, maybe it's time to give it a shot.

2
  • 1
    Why not XSLT ? That would be the easiest. ( In fact: I see there's an ?xml-stylesheet directive -- what does the default stylesheet look like ? ) Commented Nov 1, 2010 at 18:06
  • I may not always have the XSL stylesheet available, thus the need for something to format the log into something a bit more readable. If I can use the same stylesheet as one that I do have, that might also work. Commented Nov 1, 2010 at 18:11

2 Answers 2

37

BeautifulSoup makes getting at attributes and values in xml really simple. I tweaked your example function to use these features.

import sys
from BeautifulSoup import BeautifulSoup as Soup

def parseLog(file):
    file = sys.argv[1]
    handler = open(file).read()
    soup = Soup(handler)
    for message in soup.findAll('message'):
        msg_attrs = dict(message.attrs)
        f_user = message.find('from').user
        f_user_dict = dict(f_user.attrs)
        print "%s: %s [%s @ %s]" % (f_user_dict[u'friendlyname'],
                                    message.find('text').decodeContents(),
                                    msg_attrs[u'date'],
                                    msg_attrs[u'time'])


if __name__ == "__main__":
    parseLog(sys.argv[1])
Sign up to request clarification or add additional context in comments.

3 Comments

This works perfectly. What exactly is contained in the dictionary from f_user_dict = dict(f_user.attrs) I assume attributes, i'll have to toy with that piece and see exactly what's there. Thanks again!
Yup the for all elements, el.attrs will contain a list of tuples of the xml tags attributes. Calling dict on any tuple will make it into a dictionary.
Oh just wanted to clarify, calling dict on a list of tuples will return a dictionary, not a single tuple: dict([('hello', 'goodbye'), ('foo', 'bar')])
8

I'd recommend using the builtin ElementTree module. BeautifulSoup is meant to handle unwell-formed code like hacked up HTML, whereas XML is well-formed and meant to be read by an XML library.

Update: some of my recent reading here suggests lxml as a library built on and enhancing the standard ElementTree.

2 Comments

I use Beautiful Soup for parsing XML. From the docs: "Beautiful Soup is a Python library for pulling data out of HTML and XML files." Beautiful Soup will use whichever parser you tell it to, including lxml. (see crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser)
And I have seen plenty of string generated malformed XML lately.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.