UPDATE:
TomNash's answer solves the question as asked. However, attempting to use it in my real problem led to issues with quoted column names, issues when there was missing data, etc. To circumvent this I'm using CJR's suggestion in the comments to simply pickle my DataFrames.
ORIGINAL QUESTION BELOW:
I have a Panda's DataFrame in memory. I would like to be able to write it to file (using to_csv), then use read_csv to read the results into a new DataFrame. I would like the original DataFrame and the new "from file DataFrame" to have identical data types.
I've attempted to get this working by using the quoting and quotechar arguments for both to_csv and read_csv. However, this doesn't seem to do the trick.
I understand that for read_csv the dtype argument can be used to force data types, but this isn't practical for my use case (lots of auto-generated files used for regression testing).
Full example below.
tmp.py:
import pandas as pd
from csv import QUOTE_NONNUMERIC
import sys
print('Python version information:')
print(sys.version)
print('Pandas version information:')
print(pd.__version__)
df1 = pd.DataFrame([['A', '100', 100], ['B', '200', 200]])
print('df1:')
print(df1.info())
df1.to_csv('tmp.csv', index=False, quoting=QUOTE_NONNUMERIC,
quotechar='"')
df2 = pd.read_csv('tmp.csv', quoting=QUOTE_NONNUMERIC, quotechar='"')
print('df2:')
print(df2.info())
Output from running tmp.py:
Python version information:
3.7.3 (default, Jun 11 2019, 01:11:15)
[GCC 6.3.0 20170516]
Pandas version information:
0.24.2
df1:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0 2 non-null object
1 2 non-null object
2 2 non-null int64
dtypes: int64(1), object(2)
memory usage: 128.0+ bytes
None
df2:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0 2 non-null object
1 2 non-null float64
2 2 non-null float64
dtypes: float64(2), object(1)
memory usage: 128.0+ bytes
None
- Column 1: As expected, the dtype is
objectfor both DataFrames. - Column 2: Unexpected behavior. For
df1the dtype isobject, while fordf2the dtype isfloat64. - Column 3: Expected behavior.
df1has dtypeint64whiledf2has dtypefloat64. As the csv module describes,csv.QUOTE_NONNUMERIC"Instructs the reader to convert all non-quoted fields to type float."
The contents of tmp.csv are below. Notice that the second column is quoted, so I would expect read_csv to give me an object.
tmp.csv:
0,1,2
"A","100",100
"B","200",200
cPickleordillwould preserve that information - if you're just looking to regression test against a bunch of objects that might be easier than writing your own converter forpd.read_csv().difffiles when the expected results change and b) be able to crack open the files with a text editor or spreadsheet program to take a quick peek.QUOTE_NONEdoes not do what I want. The difference is that indf2both column 2 and 3 have dtypes ofint64(as opposed tofloat64).float64values indf1.iloc[:,2], justint64. The behavior ofread_csvwill convert straight tofloat64for any numeric. The solution below will preserve datatypes between read/write which is I believe what you want.object, but I can write a simple wrapper to handle this. Since it's just for testing, I'm not upset about performance implications of doing all this replacement.