0

Here is the code I have to extract blocks of text of a file that starts with "Start Text" until it encounters another "Start Text".

 with open('temp.txt', "r") as f:
     buff = []
     i = 1
     for line in f:
         if line.strip():   skips the empty lines
             buff.append(line)
         if line.startswith("Start Text"):
             output = open('file' + '%d.txt' % i, 'w')
             output.write(''.join(buff))
             output.close()
             i += 1
             buff = []  # buffer reset

INPUT: "temp.txt" has the following structure:

Start Text - ABCD  
line1  
line2  
line3  
Start Text - EFG  
line4  
Start Text - P3456  
line5  
line6  

DESIRED OUTPUT: I am trying to create multiple text files below with extracted blocks of texts.

file1.txt

Start Text - ABCD  
line1  
line2  
line3 

file2.txt

Start Text - EFG  
line4 

file3.txt

Start Text - P3456  
line5  
line6

UNDESIRED OUTPUT (What the code produces)

file1.txt

Start Text - ABCD   

file2.txt

Start Text - EFG  
line1 
line2 
line3 

file3.txt

line4 
Start Text - P3456  

Here is the issue I am facing. The code creates three files but does not write “Start Text” lines into their respective text blocks. I am not sure what I am missing. I will appreciate any pointers.

4
  • For one thing, the code is looking for "Sample ID", but the actual file has "Start Text". Commented Feb 4, 2022 at 16:04
  • Thanks for the correction Commented Feb 4, 2022 at 16:08
  • It would help if you updated the question to post the exact results you're seeing in the output files, and explain why those results aren't what you wanted. Commented Feb 4, 2022 at 16:14
  • It will create the first file with just the header and then the header will be at the bottom. Commented Feb 4, 2022 at 16:15

4 Answers 4

0

When the code sees "Start Text" in a line, it writes that line and all the previous lines to the output file.

This explains why the first output file contains only the header -- that is the first line in the input file, so obviously there aren't any previous lines.

It seems like what you really want is for the header and the following lines to be written.

I've updated your code to not write a file after seeing the very first header, and also to write a file after the input file is exhausted.

buff = []
i = 1

with open('temp.txt', "r") as f:
    for line in f:
        if line.startswith("Start Text"):
            # write a file only if buff isn't empty.  (if it is 
            # empty, this must be the very first header, so we
            # don't need to write an output file yet)
            if buff:
                output = open('file' + '%d.txt' % i, 'w')
                output.write(''.join(buff))
                output.close()
                i += 1
                buff = []  # buffer reset
        if line.strip():
            buff.append(line)

# write the final section
if buff:
    output = open('file' + '%d.txt' % i, 'w')
    output.write(''.join(buff))
    output.close()
Sign up to request clarification or add additional context in comments.

3 Comments

I edited the question with the output I am getting. You are correct- I want the header and following lines until I encounter another header.
@DaveC See my updated answer with code.
This works. Thank you for your suggestions. I will upvote I again and accept the answer once I have the reputation.
0

You're almost there. See how when you check for startswith(), then write out the buffer, and clean the buffer. As it returns to the loop, if hasn't stored the line when entering this if statement - this line is lost. Try adding it to the new buffer for the next round of lines.

             ...
             buff = []  # buffer reset
             buff.append(line) # add 'Start Text' line to next buffer

Note that your code currently will never write out the final block of text. Consider to write out the last buffer as well (i.e., when no line is left).

[EDIT after question edit] As the other answer replies, the check for startswith() causes a write to file after the line is found. However, the line has already been added to the buffer. Try switching the statements, to first detect the startswith, then write everything out if it was the case (if the buffer is not empty!), then continue by adding the line to the buffer. (the note still stands)

1 Comment

Let me give this a try. Adding buff.append(line) made it a little better, but it is still not what I needed.
0
def parse_file(fname):
    with open(fname, "r") as f:
        buff = []
        for line in f:
            if line.strip():   # skips the empty lines
                if line.startswith("Start Text") and buff:
                    yield ''.join(buff)
                    buff = []
                buff.append(line)


for idx, data in enumerate(parse_file('sample.txt'), start=1):
    with open(f'file{idx}.txt', 'w') as f:
        f.write(data)

1 Comment

This works as well. I will upvote when I have a few more reputations to cast a vote.
0

I don't think you need to build a buffer. You can just process line by line as you iterate over the input file.

class MyTempFile():
    def __init__(self):
        self.fd = None
        self.newfile = None

    def __enter__(self):
        return self

    def __exit__(self, *args):
        self.closefd()
        self.newfile = None

    def closefd(self):
        if self.fd is not None:
            self.fd.close()
            self.fd = None

    def newfile_impl(self):
        i = 0
        while True:
            self.closefd()
            i += 1
            self.fd = open(f'temp{i}.txt', 'w')
            yield

    def write(self, data):
        if self.fd is not None and data.strip():
            self.fd.write(data)

    def next_file(self):
        if self.newfile is None:
            self.newfile = self.newfile_impl()
        next(self.newfile)


with MyTempFile() as mtf:
    with open('temp.txt') as infile:
        for line in infile:
            if line.startswith('Start Text'):
                mtf.next_file()
            mtf.write(line)

3 Comments

This works, but it only creates only two files.
I have created a file with the data exactly as posted in your question and I get three files (as expected). Bear in mind that the test is case sensitive and I'm using startswith() so there mustn't be any whitespace preceding 'Start Text'
Now it works. I tend to avoid classes when I can, but this code looks beautiful. I will upvote upon building my reputation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.