Extract blocks of text that starts with "Start Text" until it encounters another "Start Text"

Question

Here is the code I have to extract blocks of text of a file that starts with "Start Text" until it encounters another "Start Text".

 with open('temp.txt', "r") as f:
     buff = []
     i = 1
     for line in f:
         if line.strip():   skips the empty lines
             buff.append(line)
         if line.startswith("Start Text"):
             output = open('file' + '%d.txt' % i, 'w')
             output.write(''.join(buff))
             output.close()
             i += 1
             buff = []  # buffer reset

INPUT: "temp.txt" has the following structure:

Start Text - ABCD  
line1  
line2  
line3  
Start Text - EFG  
line4  
Start Text - P3456  
line5  
line6

DESIRED OUTPUT: I am trying to create multiple text files below with extracted blocks of texts.

file1.txt

Start Text - ABCD  
line1  
line2  
line3

file2.txt

Start Text - EFG  
line4

file3.txt

Start Text - P3456  
line5  
line6

UNDESIRED OUTPUT (What the code produces)

file1.txt

Start Text - ABCD

file2.txt

Start Text - EFG  
line1 
line2 
line3

file3.txt

line4 
Start Text - P3456

Here is the issue I am facing. The code creates three files but does not write “Start Text” lines into their respective text blocks. I am not sure what I am missing. I will appreciate any pointers.

For one thing, the code is looking for "Sample ID", but the actual file has "Start Text". — John Gordon
– John Gordon, Commented Feb 4, 2022 at 16:04
It would help if you updated the question to post the exact results you're seeing in the output files, and explain why those results aren't what you wanted. — John Gordon
– John Gordon, Commented Feb 4, 2022 at 16:14
It will create the first file with just the header and then the header will be at the bottom. — buran
– buran, Commented Feb 4, 2022 at 16:15

John Gordon · Accepted Answer · 2022-02-04 16:55:58Z

0

When the code sees "Start Text" in a line, it writes that line and all the previous lines to the output file.

This explains why the first output file contains only the header -- that is the first line in the input file, so obviously there aren't any previous lines.

It seems like what you really want is for the header and the following lines to be written.

I've updated your code to not write a file after seeing the very first header, and also to write a file after the input file is exhausted.

buff = []
i = 1

with open('temp.txt', "r") as f:
    for line in f:
        if line.startswith("Start Text"):
            # write a file only if buff isn't empty.  (if it is 
            # empty, this must be the very first header, so we
            # don't need to write an output file yet)
            if buff:
                output = open('file' + '%d.txt' % i, 'w')
                output.write(''.join(buff))
                output.close()
                i += 1
                buff = []  # buffer reset
        if line.strip():
            buff.append(line)

# write the final section
if buff:
    output = open('file' + '%d.txt' % i, 'w')
    output.write(''.join(buff))
    output.close()

edited Feb 4, 2022 at 16:55

answered Feb 4, 2022 at 16:20

John Gordon

33.8k9 gold badges48 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

DaveC Over a year ago

I edited the question with the output I am getting. You are correct- I want the header and following lines until I encounter another header.

John Gordon Over a year ago

@DaveC See my updated answer with code.

DaveC Over a year ago

This works. Thank you for your suggestions. I will upvote I again and accept the answer once I have the reputation.

davidverweij · Accepted Answer · 2022-02-04 16:19:42Z

0

You're almost there. See how when you check for startswith(), then write out the buffer, and clean the buffer. As it returns to the loop, if hasn't stored the line when entering this if statement - this line is lost. Try adding it to the new buffer for the next round of lines.

             ...
             buff = []  # buffer reset
             buff.append(line) # add 'Start Text' line to next buffer

Note that your code currently will never write out the final block of text. Consider to write out the last buffer as well (i.e., when no line is left).

[EDIT after question edit] As the other answer replies, the check for startswith() causes a write to file after the line is found. However, the line has already been added to the buffer. Try switching the statements, to first detect the startswith, then write everything out if it was the case (if the buffer is not empty!), then continue by adding the line to the buffer. (the note still stands)

answered Feb 4, 2022 at 16:19

davidverweij

3382 silver badges15 bronze badges

1 Comment

DaveC Over a year ago

Let me give this a try. Adding buff.append(line) made it a little better, but it is still not what I needed.

buran · Accepted Answer · 2022-02-04 16:53:10Z

0

def parse_file(fname):
    with open(fname, "r") as f:
        buff = []
        for line in f:
            if line.strip():   # skips the empty lines
                if line.startswith("Start Text") and buff:
                    yield ''.join(buff)
                    buff = []
                buff.append(line)


for idx, data in enumerate(parse_file('sample.txt'), start=1):
    with open(f'file{idx}.txt', 'w') as f:
        f.write(data)

answered Feb 4, 2022 at 16:53

buran

14.4k13 gold badges45 silver badges76 bronze badges

1 Comment

DaveC Over a year ago

This works as well. I will upvote when I have a few more reputations to cast a vote.

jackal · Accepted Answer · 2022-02-04 17:41:57Z

0

I don't think you need to build a buffer. You can just process line by line as you iterate over the input file.

class MyTempFile():
    def __init__(self):
        self.fd = None
        self.newfile = None

    def __enter__(self):
        return self

    def __exit__(self, *args):
        self.closefd()
        self.newfile = None

    def closefd(self):
        if self.fd is not None:
            self.fd.close()
            self.fd = None

    def newfile_impl(self):
        i = 0
        while True:
            self.closefd()
            i += 1
            self.fd = open(f'temp{i}.txt', 'w')
            yield

    def write(self, data):
        if self.fd is not None and data.strip():
            self.fd.write(data)

    def next_file(self):
        if self.newfile is None:
            self.newfile = self.newfile_impl()
        next(self.newfile)


with MyTempFile() as mtf:
    with open('temp.txt') as infile:
        for line in infile:
            if line.startswith('Start Text'):
                mtf.next_file()
            mtf.write(line)

edited Feb 4, 2022 at 17:41

answered Feb 4, 2022 at 16:51

jackal

29.1k3 gold badges9 silver badges28 bronze badges

3 Comments

DaveC Over a year ago

This works, but it only creates only two files.

jackal Over a year ago

I have created a file with the data exactly as posted in your question and I get three files (as expected). Bear in mind that the test is case sensitive and I'm using startswith() so there mustn't be any whitespace preceding 'Start Text'

DaveC Over a year ago

Now it works. I tend to avoid classes when I can, but this code looks beautiful. I will upvote upon building my reputation.

Collectives™ on Stack Overflow

Extract blocks of text that starts with "Start Text" until it encounters another "Start Text"

4 Answers 4

3 Comments

1 Comment

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

1 Comment

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related