I am reading a large CSV into a dataframe. Pandas is setting the dtype to float64 which is using a significant amount of memeory (2.4 GB for a 300MB CSV) I would like to set some of the dtypes to Int during import. I have a lot of NaNs in the dataset, so am trying to leverage the UInt dtype in Pandas 0.24.
Initially i am reading the CSV to extract just the header and row #3 which contains the source data dtypes. Its in the form of U1, U2 and F4, so i use df.replace to change them to the pandas dtypes, UInt8, UInt16 and float64 respectively. I then create a dictionary from this dataframe. I try to use that dictionary to define the dtype when i read the CSV, but i always get: Exception has occurred: TypeError data type not understood
import pandas as pd
import numpy as np
filename = r'input/TST_short.csv'
dtype_df = pd.read_csv(filename, nrows=1, skiprows=[1,2])
dtype_df.replace('U1', 'UInt8', inplace=True)
dtype_df.replace('U2', 'UInt16',inplace=True)
dtype_df.replace('F4', 'float64',inplace=True)
dtype_dict = dtype_df.to_dict('records')
df = pd.read_csv(filename, skiprows=range(1,5), nrows=500, dtype=dtype_dict)
print(df)
if I remove the dtype= argument the df looks like this:
TIME ST1F ST2F ST3F ST4F P12V
0 43476.34869 32768.0 1.0 0.0 0.0 12.161
1 43476.34870 NaN NaN NaN NaN NaN
2 43476.34872 NaN NaN NaN NaN NaN
3 43476.34873 NaN NaN NaN NaN NaN
4 43476.34874 NaN NaN NaN NaN NaN
5 43476.34875 NaN NaN NaN NaN NaN
6 43476.34876 NaN NaN NaN NaN NaN
7 43476.34877 NaN NaN NaN NaN NaN
8 43476.34878 NaN NaN NaN NaN NaN
9 43476.34880 NaN NaN NaN NaN NaN
10 43476.34881 32768.0 1.0 0.0 0.0 NaN
11 43476.34882 NaN NaN NaN NaN 12.161
12 43476.34883 NaN NaN NaN NaN NaN
13 43476.34884 NaN NaN NaN NaN NaN
Everything is cast as float64
The dictionary (dtype_dict) looks like this:
{'TIME': 'float64', 'ST1F': 'UInt16', 'ST2F': 'UInt8', 'ST3F': 'UInt8', 'ST4F': 'UInt8', 'P12V': 'float64'}
Go easy on me if i haven't provided enough info, its my first post and I am new to this :)
Update: Thanks for the help, it seems my dictionary is not quite formatted as it should be. I've tried all the dictionary types, I thought records was the closest but when i compare i can see i have square braces encloses the dictionary from the df.
d = dtype_df.to_dict('records')
d1 = {
'TIME': 'float64',
'ST1F': 'UInt16',
'ST2F': 'UInt8',
'ST3F': 'UInt8',
'ST4F': 'UInt8',
'P12V': 'float64'}
Output:
[{'TIME': 'float64', 'ST1F': 'UInt16', 'ST2F': 'UInt8', 'ST3F': 'UInt8', 'ST4F': 'UInt8', 'P12V': 'float64'}]
{'TIME': 'float64', 'ST1F': 'UInt16', 'ST2F': 'UInt8', 'ST3F': 'UInt8', 'ST4F': 'UInt8', 'P12V': 'float64'}
The dataframe i am generating the dictionary is in this form:
TIME ST1F ST2F ST3F ST4F P12V
0 float64 UInt16 UInt8 UInt8 UInt8 float64