Read XML characters as strings into ElementTree

Question

I'm using ElementTree to compare a CSV file to an XML document. The script should update the tags if the tag matches the first cell in the CSV. The tag needs to have a non-breaking space to prevent the text from wrapping when I import the XML into a different program (InDesign).

XML Input:

<Table_title>fatal crashes by&#160;time of day</Table_title>
<cell>data1</cell>
<cell>data2</cell>
<cell>data3</cell>

CSV input:

'fatal crashes by&#160;time of day', data1, data2, data3

However, when I read the XML into the ElementTree script using ET.parse('file.xml'), it seems to render the character a non-breaking space:

<Table_title>fatal crashes by time of day</Table_title>
<cell>data1</cell>
<cell>data2</cell>
<cell>data3</cell>

Which is exactly what it should do (I think). But in this scenario, I actually want   to render as a string, so that it matches the first cell of the CSV (because when the CSV is read in, it interprets it as a string: 'fatal crashes by time of day').

Is there a way to:

Force the XML script to read the non-breaking space as a string instead of an escaped character: <Table_title>fatal crashes by time of day</Table_title>

or

Force the XML script to read the CSV and render the character as an escaped character instead of a string: 'fatal crashes by time of day', data1, data2, data3

Tomalak · Accepted Answer · 2016-07-28 15:59:57Z

2

Here is what happens.

You read this XML into ElementTree:

<Table_title>fatal crashes by&#160;time of day</Table_title>

ElementTree parses it and turns it into this DOM:

element node, name Table_title
- text node, string value: "fatal crashes by・time of day" (where ・ is to represent the character with code 160, i.e. the non-breaking space)

This is 100% correct and you can't (and should not want to) do anything about it.

Your CSV also appears to contain a snippet of XML in its first column. However, it remains un-parsed until you parse it.

If you want to be able to compare the text values, you have no choice but to XML-parse the first column.

import csv
import xml.etree.ElementTree as ET

# open your XML and CSV files...

for row in csv_reader:
    temp = ET.fromstring('<temp>' + row[0] + '</temp>')
    print(temp.text)

    # compare temp.text to your XML

answered Jul 28, 2016 at 15:59

Tomalak

339k68 gold badges547 silver badges635 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ale19 Over a year ago

This works great for hyphens and dashes. But I also want to include whitespace and non-breaking spaces. But ElementTree just reads them, interprets them as a space, and writes them out as a space instead of as   or . I get that's what it's supposed to do, but I need to maintain those entities as strings in the output...

Tomalak Over a year ago

No, ElementTree does no conversion and no "interpretation" whatsoever. The string in your DOM does not contain a space (character code 32), it contains an actual NBSP character (character code 160). They just look the same when printed to the screen.

Collectives™ on Stack Overflow

Read XML characters as strings into ElementTree

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related