1

I’m updating a Python script from 2 to 3. It reads in a manifest (i.e., [batchdate]xmlList.xml), iterates through each XML file identified in the manifest, collects stats, then outputs a stats file in tab-delimited text format. The formatting and encoding of the tab file is off, and I can’t figure out how to fix it.

for encoding in utf-8:

class UnicodeWriter:

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        self.queue = StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f

    def writerow(self, row):
        self.writer.writerow([str(s).encode("utf-8") for s in row])
        data = self.queue.getvalue()
        self.stream.write(data)
        self.queue.truncate(0)

read in xmllist.xml manifest:

xmlListPath = input('Enter the filepath of the xmlList.xml file: ').replace('"', '')
xmlListFile = codecs.open(xmlListPath)
xmlList = etree.parse(xmlListFile)
listRoot = xmlList.getroot()
xmlListFile.close()

create stats file and write header:

batchID = path.split(xmlListPath)[1]
statsFile = 'S:/Metadata/ETD/Documentation/Statistics/' + batchID.replace('xmlList.xml', '.stats.txt')
stats = open(statsFile, 'w')
wtrStats = UnicodeWriter(stats, delimiter='\t')
statsHeader = ['Author', 'Degree', 'Department', 'Embargo Start Date', 'Date Web Available',
               'Embargo Code', 'Identifier', 'PURL', 'Title', 'Comments']
wtrStats.writerow(statsHeader)

Here is how the tab file is coming out:

b'Author'   b'Degree'   b'Department'   b'Embargo Start Date'   b'Date Web Available'   b'Embargo Code' b'Identifier'   b'PURL' b'Title'    b'Comments'

                                                                                                                                          b'Confer, Matthew Phelan' b'Ph.D.'    b'Chemical & Biological Engineering'    b'01/01/2021'   b'01/01/2026'   b'4'    b'u0015_0000001_0003682'    b'http://purl.lib.ua.edu/177826'    b'EXPERIMENTAL AND COMPUTATIONAL STUDIES OF MATERIALS DECOMPOSITION'    b''

Thanks for any help.

2
  • What should the tab file come out as? Commented May 4, 2021 at 18:38
  • Why are you converting to bytes before doing self.writer.writetrow()? That is what puts the b and apostrophes with each column header. Commented May 4, 2021 at 18:38

1 Answer 1

2

The thing is that in Python3, the CSV module readers and writers expect to find strings (unicode text) - when you feed them bytes, by pre-encoding your strings, it uses the representation of those bytes objects, which is a b'...' prefixed string.

TL;DR: simply open your output file in the desired encoding, and point your csv.writer object to it - there is absolutely no need for this UnicodeWriter intermediate class you are listing.

import csv
...
stats = open(statsFile, 'w', encoding="utf-8")
wtrStats = csv.writer(stats, delimiter="\t")
...
Sign up to request clarification or add additional context in comments.

2 Comments

I find it amusing that half your answer is the TL;DR part :) Thanks for the chuckle
Actually, it is the whole answer. There is just an extra preface to it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.