python tab separated file parsing problems

Question

From mysql I am generating a tab-separated output file using outfile. I then use python to load the tsv and process it. I feel like I'm missing something, but I cannot figure out how to get csv.reader to accept data where quoted fields can contain \t tabs, \n newlines, \r carriage returns, etc. The csv.reader keeps breaking the rows on all newline characters, not just the \n newline characters outside of my quoted fields.

Settings:

with open('/path/to/file.tsv', 'rbU') as f:
    reader = csv.reader(
        f,
        delimiter='\t',
        lineterminator='\n',
        quoting=csv.QUOTE_ALL
    )
    for line in reader:
        #  do something

Example:

In the example below, \r is an actual carriage return, \n is an actual newline, and \N is what mysql is outputting for a null value.

"4256996"   "[email protected]"    "Y  "   "98230\r"   "2012-07-10T12:00:00"   "some  location"    \N  \N  "false" "aaa"   "another-field" "true"  1

The resulting output:

['4256996', '[email protected]', 'Y\t', '98230'], ['2012-07-10T12:00:00', 'some  location', '\\N', '\\N', 'false', 'aaa', 'another-field', 'true', '1']

Is there a way to get the csv.reader to read this input data properly, or is this some sort of limitation with the csv.reader object?

Note: If you try to replicate this, make sure you replace \r with an actual carriage return, \n with an actual newline, etc.

How are you opening the file? Please include the open() call and the way you set up the reader. — Martijn Pieters
– Martijn Pieters, Commented Oct 1, 2014 at 17:01
Why the 'rbU' mode? Binary mode doesn't do universal line endings, universal line endings assumes text mode instead. — Martijn Pieters
– Martijn Pieters, Commented Oct 1, 2014 at 17:42

Martijn Pieters · Accepted Answer · 2014-10-01 17:50:14Z

1

You need to open your file in binary mode only. By adding in 'U' (universal newline mode) you are instead instructing Python to replace any \r with \n.

with open('/path/to/file.tsv', 'rb') as f:

Once reading just binary data your sample input works:

>>> import csv
>>> from io import BytesIO
>>> sample = BytesIO('''\
... "4256996"\t"[email protected]"\t"Y  "\t"98230\r"\t"2012-07-10T12:00:00"\t"some  location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n''')
>>> sample.readline()
'"4256996"\t"[email protected]"\t"Y  "\t"98230\r"\t"2012-07-10T12:00:00"\t"some  location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n'
>>> sample.seek(0)
0L
>>> reader = csv.reader(sample, delimiter='\t',
...         lineterminator='\n',
...         quoting=csv.QUOTE_ALL
...     )
>>> next(reader)
['4256996', '[email protected]', 'Y  ', '98230\r', '2012-07-10T12:00:00', 'some  location', '\\N', '\\N', 'false', 'aaa', 'another-field', 'true', '1']

To illustrate, reading a line with the U mode set Python reads the data incorrectly:

>>> sample.seek(0)
0L
>>> open('/tmp/test.csv', 'wb').write(sample.read())
>>> f = open('/tmp/test.csv', 'rbU')
>>> f.readline()
'"4256996"\t"[email protected]"\t"Y  "\t"98230\n'
>>> f = open('/tmp/test.csv', 'rb')
>>> f.readline()
'"4256996"\t"[email protected]"\t"Y  "\t"98230\r"\t"2012-07-10T12:00:00"\t"some  location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n'

answered Oct 1, 2014 at 17:50

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

JesseBuesking Over a year ago

Argghhh! Thank you! I added the b option at the same time I added the U option and missed that...

Collectives™ on Stack Overflow

python tab separated file parsing problems

Settings:

Example:

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Settings:

Example:

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related