How to get consistent dtypes after to_csv and read_csv?

Question

UPDATE:

TomNash's answer solves the question as asked. However, attempting to use it in my real problem led to issues with quoted column names, issues when there was missing data, etc. To circumvent this I'm using CJR's suggestion in the comments to simply pickle my DataFrames.

ORIGINAL QUESTION BELOW:

I have a Panda's DataFrame in memory. I would like to be able to write it to file (using to_csv), then use read_csv to read the results into a new DataFrame. I would like the original DataFrame and the new "from file DataFrame" to have identical data types.

I've attempted to get this working by using the quoting and quotechar arguments for both to_csv and read_csv. However, this doesn't seem to do the trick.

I understand that for read_csv the dtype argument can be used to force data types, but this isn't practical for my use case (lots of auto-generated files used for regression testing).

Full example below.

tmp.py:

import pandas as pd
from csv import QUOTE_NONNUMERIC
import sys

print('Python version information:')
print(sys.version)
print('Pandas version information:')
print(pd.__version__)

df1 = pd.DataFrame([['A', '100', 100], ['B', '200', 200]])
print('df1:')
print(df1.info())

df1.to_csv('tmp.csv', index=False, quoting=QUOTE_NONNUMERIC,
           quotechar='"')

df2 = pd.read_csv('tmp.csv', quoting=QUOTE_NONNUMERIC, quotechar='"')
print('df2:')
print(df2.info())

Output from running tmp.py:

Python version information:
3.7.3 (default, Jun 11 2019, 01:11:15) 
[GCC 6.3.0 20170516]
Pandas version information:
0.24.2
df1:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0    2 non-null object
1    2 non-null object
2    2 non-null int64
dtypes: int64(1), object(2)
memory usage: 128.0+ bytes
None
df2:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0    2 non-null object
1    2 non-null float64
2    2 non-null float64
dtypes: float64(2), object(1)
memory usage: 128.0+ bytes
None

Column 1: As expected, the dtype is object for both DataFrames.
Column 2: Unexpected behavior. For df1 the dtype is object, while for df2 the dtype is float64.
Column 3: Expected behavior. df1 has dtype int64 while df2 has dtype float64. As the csv module describes, csv.QUOTE_NONNUMERIC "Instructs the reader to convert all non-quoted fields to type float."

The contents of tmp.csv are below. Notice that the second column is quoted, so I would expect read_csv to give me an object.

tmp.csv:

0,1,2
"A","100",100
"B","200",200

I'm not sure that you can get the behavior that you want easily from a CSV. Serializing a dataframe with cPickle or dill would preserve that information - if you're just looking to regression test against a bunch of objects that might be easier than writing your own converter for pd.read_csv(). — CJR
– CJR, Commented Jul 10, 2019 at 17:02
@CJR - thanks for your input! I'll definitely consider that. The downside to the pickling approach is version control and readability: it's nice to have plain text files in my repository to a) be able to diff files when the expected results change and b) be able to crack open the files with a text editor or spreadsheet program to take a quick peek. — blthayer
– blthayer, Commented Jul 10, 2019 at 17:08
@TomNash - QUOTE_NONE does not do what I want. The difference is that in df2 both column 2 and 3 have dtypes of int64 (as opposed to float64). — blthayer
– blthayer, Commented Jul 10, 2019 at 17:11
I think the problem is that you don't have any float64 values in df1.iloc[:,2], just int64. The behavior of read_csv will convert straight to float64 for any numeric. The solution below will preserve datatypes between read/write which is I believe what you want. — TomNash
– TomNash, Commented Jul 10, 2019 at 17:23
@TomNash - your solution below solves my problem as asked. I am having to mess around with all my column names now being quoted and perform a replacement of the quote character on all columns with dtype of object, but I can write a simple wrapper to handle this. Since it's just for testing, I'm not upset about performance implications of doing all this replacement. — blthayer
– blthayer, Commented Jul 10, 2019 at 17:37

TomNash · Accepted Answer · 2019-07-10 19:31:27Z

Try using QUOTE_NONE instead on the read, this preserves the datatypes between read/write.

Using the original datset with int64:

import pandas as pd
from csv import QUOTE_NONNUMERIC, QUOTE_NONE
import sys

print('Python version information:')
print(sys.version)
print('Pandas version information:')
print(pd.__version__)

df1 = pd.DataFrame([['A', '100', 100], ['B', '200', 200]])
print('df1:')
print(df1.info())

df1.to_csv('tmp.csv', index=False, quoting=QUOTE_NONNUMERIC, quotechar='"')

df2 = pd.read_csv('tmp.csv', quoting=QUOTE_NONE).replace('"','', regex=True)
print('df2:')
print(df2.info())

Result:

Python version information:
3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]
Pandas version information:
0.24.2
df1:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0    2 non-null object
1    2 non-null object
2    2 non-null int64
dtypes: int64(1), object(2)
memory usage: 128.0+ bytes
None
df2:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0    2 non-null object
1    2 non-null object
2    2 non-null int64
dtypes: int64(1), object(2)
memory usage: 128.0+ bytes
None

Using float64 values in the input:

import pandas as pd
from csv import QUOTE_NONNUMERIC, QUOTE_NONE, QUOTE_MINIMAL
import sys

print('Python version information:')
print(sys.version)
print('Pandas version information:')
print(pd.__version__)

df1 = pd.DataFrame([['A', '100', 100.1], ['B', '200', 200.2]])
print('df1:')
print(df1.info())

df1.to_csv('tmp.csv', index=False, quoting=QUOTE_NONNUMERIC, quotechar='"')

df2 = pd.read_csv('tmp.csv', quoting=QUOTE_NONE).replace('"','', regex=True)
print('df2:')
print(df2.info())

Result:

Python version information:
3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]
Pandas version information:
0.24.2
df1:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0    2 non-null object
1    2 non-null object
2    2 non-null float64
dtypes: float64(1), object(2)
memory usage: 128.0+ bytes
None
df2:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0    2 non-null object
1    2 non-null object
2    2 non-null float64
dtypes: float64(1), object(2)
memory usage: 128.0+ bytes
None

Collectives™ on Stack Overflow

How to get consistent dtypes after to_csv and read_csv?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related