I have a csv (around 750MB of size). I have to split it into small csv each of size not more than 30Mb.
c1,c2,c3,c4
1,a,1,4
2,a,1,4
3,b,1,4
4,b,1,4
5,b,1,4
6,c,1,4
The constraint is that you cannot have same c2 in different files.
(e.g one cannot have half b in one file other in other half in other file)
If one value of C2 itself is more than 30Mb then print the data associated with that c2 into a file
I used pandas to do the same; my code
max_size = 30 * 1000000
df = pd.read_csv("data.csv", low_memory=False)
unique_ac_id = pd.unique(df.C2)
counter = 1
df_arr = []
total_size = 0
for ac_id in unique_ac_id:
df_cur = df[df.C2 == ac_id]
size = df_cur.memory_usage(index=False, deep=True).sum()
if size > max_size:
print(f'{ac_id} size is more than max size allowded')
if total_size > max_size:
pd.concat(df_arr).to_csv(f'out/splitter_{counter}.csv', index=False)
counter += 1
df_arr.clear()
total_size = 0
df_arr.append(df_cur)
total_size += size
if len(df_arr) > 0:
pd.concat(df_arr).to_csv(f'out/splitter_{counter}.csv', index=False)
Is there a better way of doing the same?