Split csv into small csv using python

Question

I have a csv (around 750MB of size). I have to split it into small csv each of size not more than 30Mb.

c1,c2,c3,c4
1,a,1,4
2,a,1,4
3,b,1,4
4,b,1,4
5,b,1,4
6,c,1,4

The constraint is that you cannot have same c2 in different files. (e.g one cannot have half b in one file other in other half in other file) If one value of C2 itself is more than 30Mb then print the data associated with that c2 into a file

I used pandas to do the same; my code

max_size = 30 * 1000000
df = pd.read_csv("data.csv", low_memory=False)
unique_ac_id = pd.unique(df.C2)

counter = 1
df_arr = []
total_size = 0

for ac_id in unique_ac_id:
    df_cur = df[df.C2 == ac_id]
    size = df_cur.memory_usage(index=False, deep=True).sum()
    if size > max_size:
        print(f'{ac_id} size is more than max size allowded')

    if total_size > max_size:
        pd.concat(df_arr).to_csv(f'out/splitter_{counter}.csv', index=False)
        counter += 1
        df_arr.clear()
        total_size = 0

    df_arr.append(df_cur)
    total_size += size

if len(df_arr) > 0:
    pd.concat(df_arr).to_csv(f'out/splitter_{counter}.csv', index=False)

Is there a better way of doing the same?

The constraint is infeasible. What if half of your c2's are a? You wouldn't be able to condense it then. — DJK
– DJK, Commented Aug 16, 2018 at 11:57
It’s ok to pull all those to one file. In this case 30Mb need not be considered. In my data this case rarely happens. — Praveen
– Praveen, Commented Aug 16, 2018 at 13:07

ASH · Accepted Answer · 2018-08-17 12:30:46Z

1

You can easily split that CSV into equal size chunks.

import pandas as pd
for i,chunk in enumerate(pd.read_csv('C:/your_path_here/main.csv', chunksize=100)):
    chunk.to_csv('chunk{}.csv'.format(i))

answered Aug 17, 2018 at 12:30

ASH

20.5k28 gold badges117 silver badges247 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Ojingo · Accepted Answer · 2018-08-16 11:33:12Z

I guess you could use csv...?

The syntax is pretty straightforward:

>>> import csv
>>> with open('eggs.csv', 'rb') as csvfile:
...     spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
...     for row in spamreader:
...         print ', '.join(row)
Spam, Spam, Spam, Spam, Spam, Baked Beans
Spam, Lovely Spam, Wonderful Spam

Using this approach, I'd just read 30MB at a time, and spool out the read contents to another csv. Given you have the vector contents in row, you'll be able to ascertain the per-row size, and determine how many rows make ~30MB, so hopefully this will get you started.

Also, given the constraint about c2, you might end up opening up several csv's so that each csv will contain its respective c2 grouping. Each row is a vector, so in the example you gave, that would appear to be the second element.

Collectives™ on Stack Overflow

Split csv into small csv using python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related