Formatting a tab-delimited text file with Python

Question

I’m updating a Python script from 2 to 3. It reads in a manifest (i.e., [batchdate]xmlList.xml), iterates through each XML file identified in the manifest, collects stats, then outputs a stats file in tab-delimited text format. The formatting and encoding of the tab file is off, and I can’t figure out how to fix it.

for encoding in utf-8:

class UnicodeWriter:

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        self.queue = StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f

    def writerow(self, row):
        self.writer.writerow([str(s).encode("utf-8") for s in row])
        data = self.queue.getvalue()
        self.stream.write(data)
        self.queue.truncate(0)

read in xmllist.xml manifest:

xmlListPath = input('Enter the filepath of the xmlList.xml file: ').replace('"', '')
xmlListFile = codecs.open(xmlListPath)
xmlList = etree.parse(xmlListFile)
listRoot = xmlList.getroot()
xmlListFile.close()

create stats file and write header:

batchID = path.split(xmlListPath)[1]
statsFile = 'S:/Metadata/ETD/Documentation/Statistics/' + batchID.replace('xmlList.xml', '.stats.txt')
stats = open(statsFile, 'w')
wtrStats = UnicodeWriter(stats, delimiter='\t')
statsHeader = ['Author', 'Degree', 'Department', 'Embargo Start Date', 'Date Web Available',
               'Embargo Code', 'Identifier', 'PURL', 'Title', 'Comments']
wtrStats.writerow(statsHeader)

Here is how the tab file is coming out:

b'Author'   b'Degree'   b'Department'   b'Embargo Start Date'   b'Date Web Available'   b'Embargo Code' b'Identifier'   b'PURL' b'Title'    b'Comments'

                                                                                                                                          b'Confer, Matthew Phelan' b'Ph.D.'    b'Chemical & Biological Engineering'    b'01/01/2021'   b'01/01/2026'   b'4'    b'u0015_0000001_0003682'    b'http://purl.lib.ua.edu/177826'    b'EXPERIMENTAL AND COMPUTATIONAL STUDIES OF MATERIALS DECOMPOSITION'    b''

Thanks for any help.

Why are you converting to bytes before doing self.writer.writetrow()? That is what puts the b and apostrophes with each column header. — pho
– pho, Commented May 4, 2021 at 18:38

jsbueno · Accepted Answer · 2021-05-04 18:40:06Z

2

The thing is that in Python3, the CSV module readers and writers expect to find strings (unicode text) - when you feed them bytes, by pre-encoding your strings, it uses the representation of those bytes objects, which is a b'...' prefixed string.

TL;DR: simply open your output file in the desired encoding, and point your csv.writer object to it - there is absolutely no need for this UnicodeWriter intermediate class you are listing.

import csv
...
stats = open(statsFile, 'w', encoding="utf-8")
wtrStats = csv.writer(stats, delimiter="\t")
...

answered May 4, 2021 at 18:40

jsbueno

114k11 gold badges159 silver badges239 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

pho Over a year ago

I find it amusing that half your answer is the TL;DR part :) Thanks for the chuckle

jsbueno Over a year ago

Actually, it is the whole answer. There is just an extra preface to it.

Collectives™ on Stack Overflow

Formatting a tab-delimited text file with Python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related