I have the following code snippet that reads a CSV into a dataframe, and writes out key-values pairs to a file in a Redis protocol-compliant fashion, i.e. SET key1 value1. The code is piecemeal and I have tried to use multiprocessing, though I am not sure of its performance (gains).
The CSV has about 6 million lines, that is read into a dataframe pretty quickly (under 2 minutes). The output file has 12 million lines (2 lines per line of the input file). This takes about 50 minutes to complete. Can any part of my code be optimized/changed to make this run faster? Once the file is complete, loading it to Redis takes less than 90 seconds. The bottleneck really is in writing to the file. I will have several such files to write and spending 50-60 minutes per file is really not ideal. This particular dataset has 6 million rows and 10 columns, mostly comprised of strings with a few float columns. The Redis keys are the strings and the float values are the Redis values in the key-value pair. Other datasets will be similarly sized, if not bigger (both with respect to rows and columns).
I was looking into loading all the strings I generate into a dataframe and then use the to_csv() function to dump it to a file, but I'm not sure of how its performance will be.
filepath = '/path/to/file.csv'
def df_to_file:
df = pd.read_csv(filepath)
f = open('output_file', 'w')
for i in range(len(df.index)):
if df['col1'].iloc[i] != '':
key1 = str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]
key1a = str(df['col1'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]
print('SET {0} {1}\nSET {0} {1}'.format(key1, val1, key1a, val1a), file = f)
if df['col2'].iloc[i] != '':
key1 = str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]
key1a = str(df['col2'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]
print('SET {0} {1}\nSET {0} {1}'.format(key1, val1, key1a, val1a), file = f)
if df['col3'].iloc[i] != '':
key1 = str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1 = df['col_n+1'].iloc[i]
key1a = str(df['col3'].iloc[i])+str(df['col4'].iloc[i])+str(df['col5'].iloc[i])+...+str(df['col_n'].iloc[i])
val1a = df['col_n+2'].iloc[i]
print('SET {0} {1}\nSET {0} {1}'.format(key1, val1, key1a, val1a), file = f)
f.close()
p = Process(target = df_to_file)
p.start()
p.join()
iloc[i]in a loop to extract a single row, this will absolutely kill performance. Unless you give a small representative example of your output data-frame, it's hard to say moredflook like?key1andval1from DataFrame?