4

I have an issue which I think I somewhat solved but I would like to learn more about it or learn about better solutions.

The problem: I have tab separated files with ~600k lines (and one comment line), of which one field (out of 8 fields) contains a string of variable length, anything between 1 and ~2000 characters.

Reading that file with the following function is terribly slow:

df = pd.read_csv(tgfile,
                 sep="\t",
                 comment='#',
                 header=None,
                 names=list_of_names)

However, perhaps I don't care so much about most of the string (field name of this string is 'motif') and I'm okay with truncating it if it's too long using:

def truncate_motif(motif):
    if len(motif) > 8:
        return motif[:8] + '~'
    else:
        return motif

df = pd.read_csv(tgfile,
                 sep="\t",
                 comment='#',
                 header=None,
                 converters={'motif': truncate_motif},
                 names=list_of_names)

This suddenly is lots faster.

So my questions are:

  1. Why is reading this file so slow? Does it have to do with allocating memory?
  2. Why does the converter function help here? It has to execute an additional function for every row, but is still lots faster...
  3. What else can be done?
1

1 Answer 1

2
  1. You didn't mention what means slow to you, but if:
    • your file contains ca. 600k rows,
    • each row contains 1-2000 characters (let's say 1000 in average, so each line has ca. 1000B),

then this file's size is: 600 000 * 1000B ~ 570 MB. It's a lot, especially if you don't have much RAM memory.

  1. It helps, because suddenly the average size of one line is not 1000B, but ca. 6-7B (considering new max = 8B). The system does not read and keep whole strings, but only checks their length and cut if it's needed. Sounds logic to me!

  2. In such cases, when you have to load lots of data, it's good to use chunks.

for chunk in pd.read_csv(tgfile, chunksize=10000): process(chunk)

The chunksize parameter says how many rows contains one chunk. It's good to check if it improves performance in your case!

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your reply, I'll play around with that. RAM is not an issue, as I have 500GB available on this server.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.