0

From mysql I am generating a tab-separated output file using outfile. I then use python to load the tsv and process it. I feel like I'm missing something, but I cannot figure out how to get csv.reader to accept data where quoted fields can contain \t tabs, \n newlines, \r carriage returns, etc. The csv.reader keeps breaking the rows on all newline characters, not just the \n newline characters outside of my quoted fields.

Settings:

with open('/path/to/file.tsv', 'rbU') as f:
    reader = csv.reader(
        f,
        delimiter='\t',
        lineterminator='\n',
        quoting=csv.QUOTE_ALL
    )
    for line in reader:
        #  do something

Example:

In the example below, \r is an actual carriage return, \n is an actual newline, and \N is what mysql is outputting for a null value.

"4256996"   "[email protected]"    "Y  "   "98230\r"   "2012-07-10T12:00:00"   "some  location"    \N  \N  "false" "aaa"   "another-field" "true"  1

The resulting output:

['4256996', '[email protected]', 'Y\t', '98230'], ['2012-07-10T12:00:00', 'some  location', '\\N', '\\N', 'false', 'aaa', 'another-field', 'true', '1']

Is there a way to get the csv.reader to read this input data properly, or is this some sort of limitation with the csv.reader object?

Note: If you try to replicate this, make sure you replace \r with an actual carriage return, \n with an actual newline, etc.

3
  • How are you opening the file? Please include the open() call and the way you set up the reader. Commented Oct 1, 2014 at 17:01
  • @MartijnPieters updated it with a more explicit example Commented Oct 1, 2014 at 17:33
  • Why the 'rbU' mode? Binary mode doesn't do universal line endings, universal line endings assumes text mode instead. Commented Oct 1, 2014 at 17:42

1 Answer 1

1

You need to open your file in binary mode only. By adding in 'U' (universal newline mode) you are instead instructing Python to replace any \r with \n.

with open('/path/to/file.tsv', 'rb') as f:

Once reading just binary data your sample input works:

>>> import csv
>>> from io import BytesIO
>>> sample = BytesIO('''\
... "4256996"\t"[email protected]"\t"Y  "\t"98230\r"\t"2012-07-10T12:00:00"\t"some  location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n''')
>>> sample.readline()
'"4256996"\t"[email protected]"\t"Y  "\t"98230\r"\t"2012-07-10T12:00:00"\t"some  location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n'
>>> sample.seek(0)
0L
>>> reader = csv.reader(sample, delimiter='\t',
...         lineterminator='\n',
...         quoting=csv.QUOTE_ALL
...     )
>>> next(reader)
['4256996', '[email protected]', 'Y  ', '98230\r', '2012-07-10T12:00:00', 'some  location', '\\N', '\\N', 'false', 'aaa', 'another-field', 'true', '1']

To illustrate, reading a line with the U mode set Python reads the data incorrectly:

>>> sample.seek(0)
0L
>>> open('/tmp/test.csv', 'wb').write(sample.read())
>>> f = open('/tmp/test.csv', 'rbU')
>>> f.readline()
'"4256996"\t"[email protected]"\t"Y  "\t"98230\n'
>>> f = open('/tmp/test.csv', 'rb')
>>> f.readline()
'"4256996"\t"[email protected]"\t"Y  "\t"98230\r"\t"2012-07-10T12:00:00"\t"some  location"\t\\N\t\\N\t"false"\t"aaa"\t"another-field"\t"true"\t1\r\n'
Sign up to request clarification or add additional context in comments.

1 Comment

Argghhh! Thank you! I added the b option at the same time I added the U option and missed that...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.