1

I have the following two minimal XML files

history1.xml

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="en">
    <page>
        <title>AccessibleComputing</title>
    </page>
    <page>
        <title>History</title>
    </page>
</mediawiki>

history2.xml

<mediawiki>
    <page>
        <title>AccessibleComputing</title>
    </page>
    <page>
        <title>History</title>
    </page>
</mediawiki>

Note that the only difference is all the attributes in the "mediawiki" node. I'm trying to get all page titles with R. Now I type

library("XML")

doc = xmlParse('history1.xml',useInternalNodes=TRUE)

titles<-xpathSApply(doc,'//page/title',xmlValue)

and get an empty list as output

list()

If I am using the second XML file instead:

library("XML")

doc = xmlParse('history2.xml',useInternalNodes=TRUE)

titles<-xpathSApply(doc,'//page/title',xmlValue)

I get what I want, namely

[1] "AccessibleComputing" "History"

The problem is: I am downloading these lists from Wikipedia and I can't always delete these attributes by hand. So my question is:

1) Why is the second file working while the first is not?

2) Is there a way to fix this?

3) If the answer is no: can I automate deleting the attributes in R?

Any help is much appreciated!

8
  • 2
    One of the attributes defines a namespace for all the elements in your document. In general, you shouldn't just delete it as it's crucial in telling the difference between elements from different schemas that share the same name. I'm not very familiar with R but this looks like it might be of use rss.acs.unt.edu/Rdoc/library/XML/html/getNodeSet.html Commented Aug 18, 2013 at 10:57
  • 2
    You need to register the MediaWiki namespace and reference your chosen prefix in your XPath expression. See stackoverflow.com/questions/3876571/… . Something like titles<-xpathSApply(doc,'//mw:page/mw:title',xmlValue, ns= c(mw = "http://www.mediawiki.org/xml/export-0.8/")) (note: I do not know R, so the syntax with xmlValue and ns is not tested, but you get the idea; you may need to use "namespaces" instead of "ns") Commented Aug 18, 2013 at 11:00
  • @pault. Thank you so much! That this the trick. You just have to replace "ns" by "namespaces". Care to post this as an answer? Then I could accept it. Commented Aug 18, 2013 at 11:09
  • 1
    This shorter version also works: xpathSApply(doc, '//x:page/x:title', xmlValue, namespaces = "x") Commented Aug 18, 2013 at 12:55
  • 1
    xpathSApply(doc, '//*[local-name() = "title"]', xmlValue) would also work here. Commented Aug 18, 2013 at 14:01

1 Answer 1

1

You need to register the MediaWiki namespace and reference your chosen prefix in your XPath expression. See this other SO question .

Something like

titles <- xpathSApply(doc, '//mw:page/mw:title', xmlValue,
    namespaces= c(mw = "http://www.mediawiki.org/xml/export-0.8/"))

This also works:

titles <- xpathSApply(doc, '//x:page/x:title', xmlValue, namespaces= "x")
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.