27

A file contains 10000 lines with one entry in each line. I need to process the file but in batches (small chunks).

file = open("data.txt", "r")
data = file.readlines()
file.close()

total_count = len(data) # equals to ~10000 or less
max_batch = 50 # loop through 'data' with 50 entries at max in each loop.

for i in range(total_count):
     batch = data[i:i+50] # first 50 entries
     result = process_data(batch) # some time consuming processing on 50 entries
     if result == True:
           # add to DB that 50 entries are processed successfully!
     else:
           return 0 # quit the operation
           # later start again from the point it failed.
           # say 51st or 2560th or 9950th entry

What to do here so that next loop picks entries from 51 to 100th item and so on?

If somehow the operation is not successful and breaks in-between, then need to start loop again only from the batch where it failed (based on DB entry).

I'm not able to code a proper logic. Should I keep two lists? Or anything else?

2
  • 3
    range(0, total_count, 50) Commented Jan 26, 2017 at 7:52
  • your entry range needs to be [i*50:(i+1)*50], also why wait for a batch to complete - you could make process_data into a thread - take a look at this tutorialspoint.com/python/python_multithreading.htm Commented Jan 26, 2017 at 8:04

5 Answers 5

40
l = [1,2,3,4,5,6,7,8,9,10]
batch_size = 3    

for i in range(0, len(l), batch_size):
    print(l[i:i+batch_size])
    # more logic here

>>> [1,2,3]
>>> [4,5,6]
>>> [7,8,9]
>>> [10]

I think this is the most straight-forward and readable approach. If you need to retry a certain batch, you can retry inside the loop (serial) or you can open a thread per batch - depends on the application...

Sign up to request clarification or add additional context in comments.

Comments

18

You are close.

chunks = (total_count - 1) // 50 + 1
for i in range(chunks):
     batch = data[i*50:(i+1)*50]

2 Comments

It seems to be an incorrect solution if you do not want any empty batches when total_count % 50 == 0. The number of chunks should be (total_count - 1) // 50 + 1.
@AndersTornkvist: nice catch! Edited.
16

With python 3.12 you can use itertools.batched (the documentation):

for batch in itertools.batched(data, 50):
    result = process_data(batch) # some time consuming processing on 50 entries
    if result == True:
          # add to DB that 50 entries are processed successfully!
    else:
          return 0 # quit the operation

1 Comment

This is the good one for modern Python. Thank you!
7
def chunk_list(datas, chunksize):
    """Split list into the chucks

    Params:
        datas     (list): data that want to split into the chunk
        chunksize (int) : how much maximum data in each chunks

    Returns:
        chunks (obj): the chunk of list
    """

    for i in range(0, len(datas), chunksize):
        yield datas[i:i + chunksize]

ref: https://www.codegrepper.com/code-examples/python/python+function+to+split+lists+into+batches

Comments

3

I'm a big fan of funcy. This function will break your list into chunks for you: https://funcy.readthedocs.io/en/stable/seqs.html#chunks

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.