1

I have a csv (around 750MB of size). I have to split it into small csv each of size not more than 30Mb.

c1,c2,c3,c4
1,a,1,4
2,a,1,4
3,b,1,4
4,b,1,4
5,b,1,4
6,c,1,4

The constraint is that you cannot have same c2 in different files. (e.g one cannot have half b in one file other in other half in other file) If one value of C2 itself is more than 30Mb then print the data associated with that c2 into a file

I used pandas to do the same; my code

max_size = 30 * 1000000
df = pd.read_csv("data.csv", low_memory=False)
unique_ac_id = pd.unique(df.C2)

counter = 1
df_arr = []
total_size = 0

for ac_id in unique_ac_id:
    df_cur = df[df.C2 == ac_id]
    size = df_cur.memory_usage(index=False, deep=True).sum()
    if size > max_size:
        print(f'{ac_id} size is more than max size allowded')

    if total_size > max_size:
        pd.concat(df_arr).to_csv(f'out/splitter_{counter}.csv', index=False)
        counter += 1
        df_arr.clear()
        total_size = 0

    df_arr.append(df_cur)
    total_size += size

if len(df_arr) > 0:
    pd.concat(df_arr).to_csv(f'out/splitter_{counter}.csv', index=False)

Is there a better way of doing the same?

2
  • The constraint is infeasible. What if half of your c2's are a? You wouldn't be able to condense it then. Commented Aug 16, 2018 at 11:57
  • It’s ok to pull all those to one file. In this case 30Mb need not be considered. In my data this case rarely happens. Commented Aug 16, 2018 at 13:07

2 Answers 2

1

You can easily split that CSV into equal size chunks.

import pandas as pd
for i,chunk in enumerate(pd.read_csv('C:/your_path_here/main.csv', chunksize=100)):
    chunk.to_csv('chunk{}.csv'.format(i))
Sign up to request clarification or add additional context in comments.

Comments

1

I guess you could use csv...?

The syntax is pretty straightforward:

>>> import csv
>>> with open('eggs.csv', 'rb') as csvfile:
...     spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
...     for row in spamreader:
...         print ', '.join(row)
Spam, Spam, Spam, Spam, Spam, Baked Beans
Spam, Lovely Spam, Wonderful Spam

Using this approach, I'd just read 30MB at a time, and spool out the read contents to another csv. Given you have the vector contents in row, you'll be able to ascertain the per-row size, and determine how many rows make ~30MB, so hopefully this will get you started.

Also, given the constraint about c2, you might end up opening up several csv's so that each csv will contain its respective c2 grouping. Each row is a vector, so in the example you gave, that would appear to be the second element.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.