1

I am trying to parse some text files and need to extract blocks of text. Specifically, the lines that start with "1:" and 19 lines after the text. The "1:" does not start on the same row in each file and there is only one instance of "1:". I would prefer to save the block of text and export it to a separate file. In addition, I need to preserve the formatting of the text in the original file.

Needless to say I am new to Python. I generally work with R but these files are not really compatible with R and I have about 100 to process. Any information would be appreciated.

The code that I have so far is:

tmp = open(files[0],"r") 
lines = tmp.readlines()
tmp.close()

num = 0
a=0

for line in lines:
    num += 1    
    if "1:" in line:
      a = num 
      break

a = num is the line number for the block of text I want. I then want to save to another file the next 19 lines of code, but can't figure how how to do this. Any help would be appreciated.

6
  • 1
    If all you need to do is extract the lines, you can do that without writing a whole new program: egrep -A 19 "^1:" myfile.txt Commented Jul 30, 2014 at 21:50
  • This will work, but I would have to write a batch file to process all the files, right? Commented Jul 31, 2014 at 13:45
  • Maybe, maybe not. I don't have any enough knowledge of your situation to say. Commented Jul 31, 2014 at 13:49
  • How big are the input files, typically and maximum? Commented Jul 31, 2014 at 13:52
  • seconding @Robᵩ, no reason to use python for this. you can write a wrapper for grep in python if you want, but this is a problem for grep. Commented Jul 31, 2014 at 14:40

3 Answers 3

3

Here is one option. Read all lines from your file. Iterate till you find your line and return next 19 lines. You would need to handle situations where your file doesn't contain additional 19 lines.

    fh = open('yourfile.txt', 'r')
    all_lines = fh.readlines()
    fh.close()
    for count, line in enumerate(all_lines):
        if "1:" in line:
            return all_lines[count+1:count+20]
Sign up to request clarification or add additional context in comments.

4 Comments

Hiya, this may well solve the problem... but it'd be good if you could provide a little explanation about how and why it works :) Don't forget - there are heaps of newbies on Stack overflow, and they could learn a thing or two from your expertise - what's obvious to you might not be so to them.
Added some text. This is my first post here and seems there are some enthusiastic down voters!
Yeah, it's encouraged to downvote in order for people to learn what a good (or not) answer is like :) It's not personal... though people really should leave a comment as to why, and code-only answers are often downvoted (I didn't downvote btw) ;)
I like this solution as well, but I can't get return to work here for some reason. It works when I replace the return with print.
1

I always prefer to read the file into memory first, but sometimes that's not possible. If you want to use iteration then this will work:

def process_file(fname):
    with open(fname) as fp:
        for line in fp:
            if line.startswith('1:'):
                break
        else:
            return    # no '1:' in file

        yield line    # yield line containing '1:'
        for i, line in enumerate(fp):
            if i >= 19:
                break
            yield line


if __name__ == "__main__":
    with open('ouput.txt', 'w') as fp:
        for line in process_file('intxt.txt'):
            fp.write(line)

It's using the else: clause on a for-loop which you don't see very often anymore, but was created for just this purpose (the else clause if executed if the for-loop doesn't break).

Comments

0

Could be done in a one-liner...

open(files[0]).read().split('1:', 1)[1].split('\n')[:19]

or more readable

txt = open(files[0]).read()           # read the file into a big string
before, after = txt.split('1:', 1)    # split the file on the first "1:"
after_lines = after.split('\n')       # create lines from the after text
lines_to_save = after_lines[:19]      # grab the first 19 lines after "1:"

then join the lines with a newline (and add a newline to the end) before writing it to a new file:

out_text = "1:"                       # add back "1:"
out_text += "\n".join(lines_to_save)  # add all 19 lines with newlines between them
out_text += "\n"                      # add a newline at the end

open("outputfile.txt", "w").write(out_text)

to comply with best practice for reading and writing files you should also be using the with statement to ensure that the file handles are closed as soon as possible. You can create convenience functions for it:

def read_file(fname):
    "Returns contents of file with name `fname`."
    with open(fname) as fp:
         return fp.read()

def write_file(fname, txt):
    "Writes `txt` to a file named `fname`."
    with open(fname, 'w') as fp:
         fp.write(txt)

then you can replace the first line above with:

txt = read_file(files[0])

and the last line with:

write_file("outputfile.txt", out_text)

4 Comments

thebjorn, I like this solution because I can understand the code, but how do I preserve the "1:" in the output file? Also, the formatting from the original file was lost and so I think it may be best to read in line by line.
You'll have to add the 1: back manually, since the split removes it. You can print after, after_lines and lines_to_save to see what's going on.
ps: I'm not sure I understand what formatting that is getting lost -- split('\n') and "\n".join(..) should preserve the file as is..?
For what I was doing this worked perfectly because it preserved the data in the format of the original text file.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.