0

I have a file that uses \x01 as line terminator. That is line terminator is NOT newline but the bytevalue of 001. Here is the ascii representation for it which ^A.

I want to split file to size of 10 MB each. Here is what I came up with

size=10000 #10 MB
i=0
with open("in-file", "rb") as ifile:
    ofile = open("output0.txt","wb")
    data = ifile.read(size)
        while data:
            ofile.write(data)
            ofile.close()
            data = ifile.read(size)
            i+=1 
            ofile = open("output%d.txt"%(i),"wb")


    ofile.close()

However, this would result in files that are broken at arbitrary places. I want the files to be terminated only at the byte value of 001 and next read resumes from the next byte.

2
  • is the byte just \x01? Commented Aug 25, 2017 at 19:26
  • @JoranBeasley yes Commented Aug 25, 2017 at 19:32

1 Answer 1

1

if its just one byte terminal you can do something like

def read_line(f_object,terminal_byte): # its one line you could just as easily do this inline
    return "".join(iter(lambda:f_object.read(1),terminal_byte))

then make a helper function that will read all the lines in a file

def read_lines(f_object,terminal_byte):
    tmp = read_line(f_object,terminal_byte)
    while tmp:
        yield tmp
        tmp = read_line(f_object,terminal_byte)

then make a function that will chunk it up

def make_chunks(f_object,terminal_byte,max_size):
    current_chunk = []
    current_chunk_size = 0
    for line in read_lines(f_object,terminal_byte):
        current_chunk.append(line)
        current_chunk_size += len(line)
        if current_chunk_size > max_size:
            yield "".join(current_chunk)
            current_chunk = []
            current_chunk_size = 0
    if current_chunk:
        yield "".join(current_chunk)

then just do something like

with open("my_binary.dat","rb") as f_in:
    for i,chunk in enumerate(make_chunks(f_in,"\x01",1024*1000*10)):
        with open("out%d.dat"%i,"wb") as f_out:
            f_out.write(chunk)

there might be some way to do this with libraries (or even an awesome builtin way) but im not aware of any offhand

Sign up to request clarification or add additional context in comments.

6 Comments

It doesnt seem to split on the terminal_byte. The terminal byte I used is bytes(chr(1))
I just noticed that terminal byte is not written in output file. I want to join on "\x01"
I modified "".join(iter(lambda:f_object.read(1),terminal_byte)) to "\x01".join(iter(lambda:f_object.read(1),terminal_byte)) and yield "\x01".join(current_chunk). but that is not working
what does "that is not working" mean? you probably just want "\x01".join(current_chunk)+"\x01"
that did not work. however, this worked. def read_line(f_object,terminal_byte): return ''.join(iter(lambda:f_object.read(1),terminal_byte)) + "\x01"
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.