I have an issue which I think I somewhat solved but I would like to learn more about it or learn about better solutions.
The problem: I have tab separated files with ~600k lines (and one comment line), of which one field (out of 8 fields) contains a string of variable length, anything between 1 and ~2000 characters.
Reading that file with the following function is terribly slow:
df = pd.read_csv(tgfile,
sep="\t",
comment='#',
header=None,
names=list_of_names)
However, perhaps I don't care so much about most of the string (field name of this string is 'motif') and I'm okay with truncating it if it's too long using:
def truncate_motif(motif):
if len(motif) > 8:
return motif[:8] + '~'
else:
return motif
df = pd.read_csv(tgfile,
sep="\t",
comment='#',
header=None,
converters={'motif': truncate_motif},
names=list_of_names)
This suddenly is lots faster.
So my questions are:
- Why is reading this file so slow? Does it have to do with allocating memory?
- Why does the converter function help here? It has to execute an additional function for every row, but is still lots faster...
- What else can be done?