3

I've got a log file on external computer in my LAN network. Log is an XML file. File is not accessible from http, and is updating every second. Currently i'm copying log file into my computer and run parser, but I want to parse file directly from external host.

How can I do it in Python? Is it possible, to parse whole file once, and later parse only new content added to the end in future versions?

2
  • "Changing" is too vague. If the only possible changes are append operations, as opposed to content changes within parts of the file that were already written, that makes this feasible to implement in an efficient way -- but just saying that it "changes" and that you want "differences" are too general, as that allows non-append operations, and an incremental parser that can handle in-place edits is an extremely complex undertaking. (Most of the folks doing research in that area are building IDEs; it would be feasible to point to their work, but I'd expect adopting it to be a huge effort). Commented Mar 9, 2015 at 16:59
  • Do you have access to the host for you to run your own program? What protocol are you using to copy the log file onto your computer? What OS is the host and your computer running? Commented Mar 9, 2015 at 17:03

2 Answers 2

4
+25

You can use paramiko and xml.sax's default parser, xml.sax.expatreader, which implements xml.sax.xmlreader.IncrementalParser.

I ran the following script on local virtual machine to produce XML.

#!/bin/bash

echo "<root>" > data.xml
I=0
while sleep 2; do 
  echo "<entry><a>value $I</a><b foo='bar' /></entry>" >> data.xml; 
  I=$((I + 1)); 
done

Here's incremental consumer.

#!/usr/bin/env python
# -*- coding: utf-8 -*-


import time
import xml.sax
from contextlib import closing

import paramiko.client


class StreamHandler(xml.sax.handler.ContentHandler):

  lastEntry = None
  lastName  = None


  def startElement(self, name, attrs):
    self.lastName = name
    if name == 'entry':
      self.lastEntry = {}
    elif name != 'root':
      self.lastEntry[name] = {'attrs': attrs, 'content': ''}

  def endElement(self, name):
    if name == 'entry':
      print({
        'a' : self.lastEntry['a']['content'],
        'b' : self.lastEntry['b']['attrs'].getValue('foo')
      }) 
      self.lastEntry = None

  def characters(self, content):
    if self.lastEntry:
      self.lastEntry[self.lastName]['content'] += content


if __name__ == '__main__':
  # use default ``xml.sax.expatreader``
  parser = xml.sax.make_parser()
  parser.setContentHandler(StreamHandler())

  client = paramiko.client.SSHClient()
  # or use ``client.load_system_host_keys()`` if appropriate
  client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
  client.connect('192.168.122.40', username = 'root', password = 'pass')
  with closing(client) as ssh:
    with closing(ssh.open_sftp()) as sftp:
      with closing(sftp.open('/root/data.xml')) as f:
        while True:
          buffer = f.read(4096)
          if buffer:
            parser.feed(buffer)
          else:
            time.sleep(2)
Sign up to request clarification or add additional context in comments.

2 Comments

It's worth noting that this only addresses appends, as opposed to in-place edits. The OP's question refers to logging -- which applies to appends -- but also asks about "differences", which is a phrase generally used in the context of in-place edits, so there's probably some clarification called for before one can tell with certainty whether this answer is on-point.
@CharlesDuffy Take a look at the first comment to the first answer. The OP says XML is updated by adding lines.
1

I am assuming that another process of which you don't have access to is maintaining the xml as an object being updated every so often, and then dumping the result.

If you don't have access to the source of the program dumping the XML, you will need a fancy diffing between the two XML versions to get an incremental update to send over the network.

And I think you would have to parse the new XML each time to be able to have that diff.

So maybe you could have a python process watching the file, parsing the new version, diffing it (for instance using solutions from this article), and then you can send that difference over the network using a tool like xmlrpc. If you want to save bandwidth it'll probably help. Although I think I would send directly the raw diff via network, patch and parse the file in the local machine.

However, if only some of your XML values are changing (no node deletion or insertion) there may be a faster solution. Or, if the only operation on your xml file is to append new trees, then you should be able to parse only these new trees and send them over (diff first, then parse in the server, send to client, merge in client).

5 Comments

External computer is out of my reach. I can only log in by ssh, and download file, i can't run any application on it. XML is updated by adding lines. I don't even know when file is changed, i have to check every 0,6 second. Network bandwidth is not a problem - it is in my LAN network.
How about having one process that computes the diff of your local and remote file and drop that diff as a temp file every 10 second. Then your python process could be watching that temp file, parse it when it arrives and add the nodes to the current XML.
@Lilley, using knowledge of the location of text changes to inform an incrimental parser is feasible, but it's also very, very nontrivial -- requires a parser written for the job, and the only ones I know of are written in Java or LISP, not Python.
When you say "log in by ssh" do you mean you can open a shell on the external computer or something else? If you can log in to a shell you can run a program.
Yeah, but I've got only rights to read directory, not to execute scripts.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.