pandas.read_csv slow when reading file with variable length string

Question

I have an issue which I think I somewhat solved but I would like to learn more about it or learn about better solutions.

The problem: I have tab separated files with ~600k lines (and one comment line), of which one field (out of 8 fields) contains a string of variable length, anything between 1 and ~2000 characters.

Reading that file with the following function is terribly slow:

df = pd.read_csv(tgfile,
                 sep="\t",
                 comment='#',
                 header=None,
                 names=list_of_names)

However, perhaps I don't care so much about most of the string (field name of this string is 'motif') and I'm okay with truncating it if it's too long using:

def truncate_motif(motif):
    if len(motif) > 8:
        return motif[:8] + '~'
    else:
        return motif

df = pd.read_csv(tgfile,
                 sep="\t",
                 comment='#',
                 header=None,
                 converters={'motif': truncate_motif},
                 names=list_of_names)

This suddenly is lots faster.

So my questions are:

Why is reading this file so slow? Does it have to do with allocating memory?
Why does the converter function help here? It has to execute an additional function for every row, but is still lots faster...
What else can be done?

Maybe some insight here: stackoverflow.com/questions/29678154/… — Kevin Glasson
– Kevin Glasson, Commented Jan 22, 2019 at 14:59

Rafaó · Accepted Answer · 2019-01-22 15:07:43Z

2

You didn't mention what means slow to you, but if:
- your file contains ca. 600k rows,
- each row contains 1-2000 characters (let's say 1000 in average, so each line has ca. 1000B),

then this file's size is: 600 000 * 1000B ~ 570 MB. It's a lot, especially if you don't have much RAM memory.

It helps, because suddenly the average size of one line is not 1000B, but ca. 6-7B (considering new max = 8B). The system does not read and keep whole strings, but only checks their length and cut if it's needed. Sounds logic to me!
In such cases, when you have to load lots of data, it's good to use chunks.

for chunk in pd.read_csv(tgfile, chunksize=10000): process(chunk)

The chunksize parameter says how many rows contains one chunk. It's good to check if it improves performance in your case!

answered Jan 22, 2019 at 15:07

Rafaó

5994 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Wouter De Coster Over a year ago

Thanks for your reply, I'll play around with that. RAM is not an issue, as I have 500GB available on this server.

Collectives™ on Stack Overflow

pandas.read_csv slow when reading file with variable length string

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related