I am using R's XML package to extract all possible data over a wide variety of html and xml files. These files are basically documentation or build properties or readme file.
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE chapter PUBLIC '-//OASIS//DTD DocBook XML V4.1.2//EN'
'http://www.oasis-open.org/docbook/xml/4.0 docbookx.dtd'>
<chapter lang="en">
<chapterinfo>
<author>
<firstname>Jirka</firstname>
<surname>Kosek</surname>
</author>
<copyright>
<year>2001</year>
<holder>Jiří Kosek</holder>
</copyright>
<releaseinfo>$Id: htmlhelp.xml,v 1.1 2002/05/15 17:22:31 isberg Exp $</releaseinfo>
</chapterinfo>
<title>Using XSL stylesheets to generate HTML Help</title>
<?dbhtml filename="htmlhelp.html"?>
<para>HTML Help (HH) is help-format used in newer versions of MS
Windows and applications written for this platform. This format allows
to pack several HTML files together with images, table of contents and
index into single file. Windows contains browser for this file-format
and full-text search is also supported on HH files. If you want know
more about HH and its capabilities look at <ulink
url="http://msdn.microsoft.com/library/tools/htmlhelp/chm/HH1Start.htm">HTML
Help pages</ulink>.</para>
<section>
<title>How to generate first HTML Help file from DocBook sources</title>
<para>Working with HH stylesheets is same as with other XSL DocBook
stylesheets. Simply run your favorite XSLT processor on your document
with stylesheet suited for HH:</para>
</section>
</chapter>
My goal is to just use xmlValue after parsing the tree using htmlTreeParse or xmlTreeParse using something like this (for xml files ..)
Text = xmlValue(xmlRoot(xmlTreeParse(XMLFileName)))
However, there is one error when I do this for both xml and html files. If there are child nodes at level 2 or more, the text fields get pasted without any space in between them.
For example, in the above example
xmlValue(chapterInfo) is
JirkaKosek2001JiKosek$Id: htmlhelp.xml,v 1.1 2002/05/15 17:22:31 isberg Exp
The xmlValues of each child node (recursive) is pasted together without adding space between them. How can I get xmlValue to add a whitespace while extracting this data
Thanks a lot for your help in advance,
Shivani