Efficiently write millions of lines to a file using Python dataframe

Question

I have the following code snippet that reads a CSV into a dataframe, and writes out key-values pairs to a file in a Redis protocol-compliant fashion, i.e. SET key1 value1. The code is piecemeal and I have tried to use multiprocessing, though I am not sure of its performance (gains).

The CSV has about 6 million lines, that is read into a dataframe pretty quickly (under 2 minutes). The output file has 12 million lines (2 lines per line of the input file). This takes about 50 minutes to complete. Can any part of my code be optimized/changed to make this run faster? Once the file is complete, loading it to Redis takes less than 90 seconds. The bottleneck really is in writing to the file. I will have several such files to write and spending 50-60 minutes per file is really not ideal. This particular dataset has 6 million rows and 10 columns, mostly comprised of strings with a few float columns. The Redis keys are the strings and the float values are the Redis values in the key-value pair. Other datasets will be similarly sized, if not bigger (both with respect to rows and columns).

I was looking into loading all the strings I generate into a dataframe and then use the to_csv() function to dump it to a file, but I'm not sure of how its performance will be.

filepath = '/path/to/file.csv'

def df_to_file:
    df = pd.read_csv(filepath)
    f = open('output_file', 'w')
    for i in range(len(df.index)):
        if df['col1'].iloc[i] != '':
            key1 = str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
            val1 = df['col_n+1'].iloc[i]

            key1a = str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
            val1a = df['col_n+2'].iloc[i]

            print('SET {0} {1}\nSET {0} {1}'.format(key1, val1, key1a, val1a), file = f)

        if df['col2'].iloc[i] != '':
            key1 = str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
            val1 = df['col_n+1'].iloc[i]

            key1a = str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
            val1a = df['col_n+2'].iloc[i]

            print('SET {0} {1}\nSET {0} {1}'.format(key1, val1, key1a, val1a), file = f)
        if df['col3'].iloc[i] != '':
            key1 = str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
            val1 = df['col_n+1'].iloc[i]

            key1a = str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
            val1a = df['col_n+2'].iloc[i]

            print('SET {0} {1}\nSET {0} {1}'.format(key1, val1, key1a, val1a), file = f)
    f.close()

p = Process(target = df_to_file)
p.start()
p.join()

Yes, don't use a loop, at least, don't use iloc[i] in a loop to extract a single row, this will absolutely kill performance. Unless you give a small representative example of your output data-frame, it's hard to say more — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Feb 10, 2018 at 1:19
@juanpa.arrivillaga - there is no output data frame. The input is a data frame and the output is the file. The numbers mentioned here are real as well and not hypothesized or extrapolated. Using Python 2.7. — CodingInCircles
– CodingInCircles, Commented Feb 10, 2018 at 1:21
No, I meant give an example of the data-frame that you are trying to output to a file. I.e., what does df look like? — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Feb 10, 2018 at 1:22
Can you show how you select key1 and val1 from DataFrame? — relay
– relay, Commented Feb 10, 2018 at 1:22

juanpa.arrivillaga · Accepted Answer · 2018-02-10 01:50:43Z

2

Using a construction like df['col1'].loc[...] to loop over individual rows is going to be slow, the iloc and loc based selectors are for selecting across entire data-frames, and do a lot of stuff related to index-alignment that will have high overhead if done for each row. Instead, simply using df.itertuples() to iterate over rows will be significantly faster.

def df_to_file:
    df = pd.read_csv(filepath)
    f = open('output_file', 'wb') # writing in binary mode should be faster, if it is possible without unicode problems
    for row in df.itertuples():
        if row.col1:
            key1, val1 = string1, string2
            key1a, val1a = string1a, string2a
            print('SET {0} {1}\nSET {0} {1}'.format(key1, val1, key1a, val1a), file = f)
        if row.col2:
            key1, val1 = string1, string2
            key1a, val1a = string1a, string2a
            print('SET {0} {1}\nSET {0} {1}'.format(key1, val1, key1a, val1a), file = f)
        if row.col3:
            key1, val1 = string1, string2
            key1a, val1a = string1a, string2a
            print('SET {0} {1}\nSET {0} {1}'.format(key1, val1, key1a, val1a), file = f)
    f.close()

This is perhaps the bare-minimum optimization you could make. If you described in more detail exactly what you are doing, perhaps a vectorized solution could be found.

Also, don't use the above with multiprocessing.

Also, as written, 'SET {0} {1}\nSET {0} {1}'.format(key1, val1, key1a, val1a) will always be the same. If those parameters aren't changing, then simply do the string concatenation once outside the loop and re-use the whole string in the loop.

Edit: Seems you can't do that above However, given:

This particular dataset has 6 million rows and 10 columns, mostly comprised of strings with a few float columns. The Redis keys are the strings and the float values are the Redis values in the key-value pair.

Then simply key1 = ''.join(row.col1, row.col4, row.col5, ...) Don't use str and the + operator, this is horribly inefficient, doubly so since you imply those columns are already strings. If you must call str on all those columns, use map(str, ...)

Finally, if you really need to squeeze performance out, note that row will be namedtuple objects, which are tuples, and you cna use integer-based indexing instead of attribute-based label access, i.e. row[1] instead of row.col1 (note, row[0] will be row.index, i.e. the index)` which should be faster (and it will make a difference since you are indexing into the tuple dozens of times per iterations and doing millions of iterations).

edited Feb 10, 2018 at 1:50

answered Feb 10, 2018 at 1:29

juanpa.arrivillaga

97.6k14 gold badges141 silver badges190 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

CodingInCircles Over a year ago

I just edited the question. Please take a look and update your answer (if needed). Thanks!

juanpa.arrivillaga Over a year ago

@CodingInCircles please provide a concrete example. Anyway, yes, your string-concatenation is horribly inefficient. Use ''.join instead of str(x) + str(y) + ... + str(z)

CodingInCircles Over a year ago

Thank you! I will try the changes out and let you know how it goes.

CodingInCircles Over a year ago

Wow! Your suggestions helped and it sped up exponentially! 12 million records written in under 1 minute! Thank you so much!

CodingInCircles Over a year ago

I asked the same question on Code Review SE, and thought I'd get an answer faster there, but I got it here instead. If you write this answer there, I can mark it as the accepted answer. Here's the link: codereview.stackexchange.com/questions/187220/….

Collectives™ on Stack Overflow

Efficiently write millions of lines to a file using Python dataframe

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related