parse blocks of text from text file using Python

Question

I am trying to parse some text files and need to extract blocks of text. Specifically, the lines that start with "1:" and 19 lines after the text. The "1:" does not start on the same row in each file and there is only one instance of "1:". I would prefer to save the block of text and export it to a separate file. In addition, I need to preserve the formatting of the text in the original file.

Needless to say I am new to Python. I generally work with R but these files are not really compatible with R and I have about 100 to process. Any information would be appreciated.

The code that I have so far is:

tmp = open(files[0],"r") 
lines = tmp.readlines()
tmp.close()

num = 0
a=0

for line in lines:
    num += 1    
    if "1:" in line:
      a = num 
      break

a = num is the line number for the block of text I want. I then want to save to another file the next 19 lines of code, but can't figure how how to do this. Any help would be appreciated.

If all you need to do is extract the lines, you can do that without writing a whole new program: egrep -A 19 "^1:" myfile.txt — Robᵩ
– Robᵩ, Commented Jul 30, 2014 at 21:50
This will work, but I would have to write a batch file to process all the files, right? — user44796
– user44796, Commented Jul 31, 2014 at 13:45
Maybe, maybe not. I don't have any enough knowledge of your situation to say. — Robᵩ
– Robᵩ, Commented Jul 31, 2014 at 13:49
seconding @Robᵩ, no reason to use python for this. you can write a wrapper for grep in python if you want, but this is a problem for grep. — acushner
– acushner, Commented Jul 31, 2014 at 14:40

user3885927 · Accepted Answer · 2014-07-30 23:26:38Z

3

Here is one option. Read all lines from your file. Iterate till you find your line and return next 19 lines. You would need to handle situations where your file doesn't contain additional 19 lines.

    fh = open('yourfile.txt', 'r')
    all_lines = fh.readlines()
    fh.close()
    for count, line in enumerate(all_lines):
        if "1:" in line:
            return all_lines[count+1:count+20]

edited Jul 30, 2014 at 23:26

answered Jul 30, 2014 at 22:17

user3885927

3,5932 gold badges29 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Taryn East Over a year ago

Hiya, this may well solve the problem... but it'd be good if you could provide a little explanation about how and why it works :) Don't forget - there are heaps of newbies on Stack overflow, and they could learn a thing or two from your expertise - what's obvious to you might not be so to them.

user3885927 Over a year ago

Added some text. This is my first post here and seems there are some enthusiastic down voters!

Taryn East Over a year ago

Yeah, it's encouraged to downvote in order for people to learn what a good (or not) answer is like :) It's not personal... though people really should leave a comment as to why, and code-only answers are often downvoted (I didn't downvote btw) ;)

user44796 Over a year ago

I like this solution as well, but I can't get return to work here for some reason. It works when I replace the return with print.

thebjorn · Accepted Answer · 2014-07-31 18:06:11Z

I always prefer to read the file into memory first, but sometimes that's not possible. If you want to use iteration then this will work:

def process_file(fname):
    with open(fname) as fp:
        for line in fp:
            if line.startswith('1:'):
                break
        else:
            return    # no '1:' in file

        yield line    # yield line containing '1:'
        for i, line in enumerate(fp):
            if i >= 19:
                break
            yield line


if __name__ == "__main__":
    with open('ouput.txt', 'w') as fp:
        for line in process_file('intxt.txt'):
            fp.write(line)

It's using the else: clause on a for-loop which you don't see very often anymore, but was created for just this purpose (the else clause if executed if the for-loop doesn't break).

thebjorn · Accepted Answer · 2014-07-31 17:41:38Z

0

Could be done in a one-liner...

open(files[0]).read().split('1:', 1)[1].split('\n')[:19]

or more readable

txt = open(files[0]).read()           # read the file into a big string
before, after = txt.split('1:', 1)    # split the file on the first "1:"
after_lines = after.split('\n')       # create lines from the after text
lines_to_save = after_lines[:19]      # grab the first 19 lines after "1:"

then join the lines with a newline (and add a newline to the end) before writing it to a new file:

out_text = "1:"                       # add back "1:"
out_text += "\n".join(lines_to_save)  # add all 19 lines with newlines between them
out_text += "\n"                      # add a newline at the end

open("outputfile.txt", "w").write(out_text)

to comply with best practice for reading and writing files you should also be using the with statement to ensure that the file handles are closed as soon as possible. You can create convenience functions for it:

def read_file(fname):
    "Returns contents of file with name `fname`."
    with open(fname) as fp:
         return fp.read()

def write_file(fname, txt):
    "Writes `txt` to a file named `fname`."
    with open(fname, 'w') as fp:
         fp.write(txt)

then you can replace the first line above with:

txt = read_file(files[0])

and the last line with:

write_file("outputfile.txt", out_text)

edited Jul 31, 2014 at 17:41

answered Jul 30, 2014 at 21:45

thebjorn

27.6k12 gold badges107 silver badges152 bronze badges

4 Comments

user44796 Over a year ago

thebjorn, I like this solution because I can understand the code, but how do I preserve the "1:" in the output file? Also, the formatting from the original file was lost and so I think it may be best to read in line by line.

thebjorn Over a year ago

You'll have to add the 1: back manually, since the split removes it. You can print after, after_lines and lines_to_save to see what's going on.

thebjorn Over a year ago

ps: I'm not sure I understand what formatting that is getting lost -- split('\n') and "\n".join(..) should preserve the file as is..?

user44796 Over a year ago

For what I was doing this worked perfectly because it preserved the data in the format of the original text file.

Collectives™ on Stack Overflow

parse blocks of text from text file using Python

3 Answers 3

4 Comments

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related