1

I'm using ElementTree to compare a CSV file to an XML document. The script should update the tags if the tag matches the first cell in the CSV. The tag needs to have a non-breaking space to prevent the text from wrapping when I import the XML into a different program (InDesign).

XML Input:

<Table_title>fatal crashes by&#160;time of day</Table_title>
<cell>data1</cell>
<cell>data2</cell>
<cell>data3</cell>

CSV input:

'fatal crashes by&#160;time of day', data1, data2, data3

However, when I read the XML into the ElementTree script using ET.parse('file.xml'), it seems to render the character a non-breaking space:

<Table_title>fatal crashes by time of day</Table_title>
<cell>data1</cell>
<cell>data2</cell>
<cell>data3</cell>

Which is exactly what it should do (I think). But in this scenario, I actually want &#160; to render as a string, so that it matches the first cell of the CSV (because when the CSV is read in, it interprets it as a string: 'fatal crashes by&#160;time of day').

Is there a way to:

  1. Force the XML script to read the non-breaking space as a string instead of an escaped character: <Table_title>fatal crashes by&#160;time of day</Table_title>

or

  1. Force the XML script to read the CSV and render the character as an escaped character instead of a string: 'fatal crashes by time of day', data1, data2, data3

1 Answer 1

2

Here is what happens.

You read this XML into ElementTree:

<Table_title>fatal crashes by&#160;time of day</Table_title>

ElementTree parses it and turns it into this DOM:

  • element node, name Table_title
    • text node, string value: "fatal crashes by・time of day" (where is to represent the character with code 160, i.e. the non-breaking space)

This is 100% correct and you can't (and should not want to) do anything about it.

Your CSV also appears to contain a snippet of XML in its first column. However, it remains un-parsed until you parse it.

If you want to be able to compare the text values, you have no choice but to XML-parse the first column.

import csv
import xml.etree.ElementTree as ET

# open your XML and CSV files...

for row in csv_reader:
    temp = ET.fromstring('<temp>' + row[0] + '</temp>')
    print(temp.text)

    # compare temp.text to your XML 
Sign up to request clarification or add additional context in comments.

2 Comments

This works great for hyphens and dashes. But I also want to include whitespace and non-breaking spaces. But ElementTree just reads them, interprets them as a space, and writes them out as a space instead of as &#160; or &#13;. I get that's what it's supposed to do, but I need to maintain those entities as strings in the output...
No, ElementTree does no conversion and no "interpretation" whatsoever. The string in your DOM does not contain a space (character code 32), it contains an actual NBSP character (character code 160). They just look the same when printed to the screen.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.