2

I need to access the internal binary representation of a loaded XML DOM... There are some dump functions, but I not see something like "binary buffer" (there are only "XML buffers").

My last objective is to compare byte-by-byte, the same document, before and after some black-box procedure, directly with their binary (current and cached) representations, without convertion (to XML-text representation)... So, the question,

There are a binary representation (in-memory structures) in LibXML2, to compare dump with current representations?

I need only to check if current and dumped DOMs are equivalent.


Details

It is not a problem of comparing two distinct DOM objects, but something more easy, because not change IDs, etc. not need canonical representation (!), only need access to internal representation, because is very faster than convert to text.

Between "before and after" there are a black-box procedure, ex. a XSLT Identity transform that affects (or not) some nodes or attributes.

Alternative solution...

  1. ... To develop a C function for LibXML2 that compares node-by-node the two trees, and return false if they are different: during the tree traversal, if tree structure changes, or some nodeValue changes, the algorithm stops the comparison (returning false).

  2. ... Not the ideal, but helps some other algorithms: if I can access (in LibXML2) the total number of nodes or the total length or size or md5 or sha1... Only to optimize frequent cases (for my application) where the comparison will returns false, avoiding the complete comparison-procedure.


NOTES

Related questions

Warning for reader using answered solutions

The problem is about "to compare before with after a back-box operation", but there are two kinds of back-boxes here:

  • Well-known and controllable ones, like XSLT transforms or use of a known library. You must known that your black-boxes will not change attribute order or ID content or denormalize spaces (or etc.).
  • Full-free ones, like use of a external editor (ex. online-editor changing a XHTML), where user and software can do anything.

I will use a solution in a context of "well-known" black-box. So, my comments at "Details" section above, are valid.

In a context of "full-free" back-boxes, you can not to use a "comparison of binary dumps", because only a canonical representation (C14N) is valid to compare. To compare by C14N-criteria, only "Alternative solutions" (commented above) are possible. For alternative-1, you must, among other things, sort before compare a set of attribute-nodes. For alternative-2 (also discussed here), to generate the C14N dumps.


PS: of course, use of the C14N criteria is subjective, depends on application: if, p. ex., for your appication "change attribute order" is a valid/important change, the comparasion that detects it is valid (!).

13
  • You do know that XML is a text format, so the binary representation would just be a sequence of characters in whatever encoding the XML is in. Commented Jul 24, 2014 at 17:42
  • @JoachimPileborg, yes, the nodes are "text nodes", but there are no (binary) tree representation? I think there are (!)... I not see there (some graphic documentation?) what the name of the C data-structure for this main tree, that is distinct from a "XML-text dump". Commented Jul 24, 2014 at 17:44
  • 1
    Yes there is a binary representation, and if the XML is UTF-8 encoded the binary representation is a sequence of UTF-8 code points. What would the binary representation of e.g. <xml><something attr="5">some data</something></xml> be if not the binary representation of the characters making up the XML, in other words the characters themselves. Commented Jul 24, 2014 at 17:50
  • Or do you mean the actual in-memory structures that libxml2 uses? Commented Jul 24, 2014 at 17:51
  • 1
    You can't do a byte-by-byte comparison of the internal representation of two DOM documents to see if they're the same. Two DOM documents created from the exact same XML document will compare differently because of various bits of data used in the internal representation (like pointers) are specific to the particular DOM document instance. Commented Jul 24, 2014 at 18:00

1 Answer 1

1

Here are the relevant libxml2 methods:

There is a base64 encoding method:

Function: xmlTextWriterWriteBase64

int xmlTextWriterWriteBase64    (xmlTextWriterPtr writer, 
                     const char * data, 
                     int start, 
                     int len)

Write an base64 encoded xml text.
writer: the xmlTextWriterPtr
data:   binary data
start:  the position within the data of the first byte to encode
len:    the number of bytes to encode
Returns:    the bytes written (may be 0 because of buffering) or -1 in case of error

and a BinHex encoding method:

Function: xmlTextWriterWriteBinHex
int xmlTextWriterWriteBinHex    (xmlTextWriterPtr writer, 
                     const char * data, 
                     int start, 
                     int len)

Write a BinHex encoded xml text.
writer: the xmlTextWriterPtr
data:   binary data
start:  the position within the data of the first byte to encode
len:    the number of bytes to encode
Returns:    the bytes written (may be 0 because of buffering) or -1 in case of error

References

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.