0

everyone, I have a big file in the format given below. The data is in the "block" format. one "block" containing three rows: the time T, the user U, and the content W. for example, this is a block:

T   2009-06-11 21:57:23
U   tracygazzard
W   David Letterman is good man

Since i will only using the block containing specific key word. I slice the data from the original massive data block by block, rather than dump the whole data into memory. each time read in one block, and if the row of content containing the word of "bike", write this block into disk.

you can use the following two blocks to test your script.

T   2009-06-11 21:57:23
U   tracygazzard
W   David Letterman is good man

T   2009-06-11 21:57:23
U   charilie
W   i want a bike

I have tried to do the work line by line:

data = open("OWS.txt", 'r')
output = open("result.txt", 'w')

for line in data:
    if line.find("bike")!= -1:
    output.write(line)
5
  • thank you, I have tried using for line in data: if line.find("bike")!= -1: output.write(line) Commented May 5, 2012 at 7:53
  • thus i can solve the problem line by line, but i don't know how to do it block by block. and you don't need to give all the code. just the key part. Commented May 5, 2012 at 7:55
  • 1
    Do the lines in each block actually start with T, U and W? Commented May 5, 2012 at 8:03
  • Yes. For this specific data. it's formatted in this way. Commented May 5, 2012 at 8:08
  • I have tried the script in the hyperlink before, however, it's not concise. But you can also refer to github.com/chengjun/python/blob/master/workshop%20of%20python/… Commented May 5, 2012 at 8:09

2 Answers 2

1

As the format of your blocks is constant, you can use a list to hold a block, then see if bike is in that block:

data = open("OWS.txt", 'r')
output = open("result.txt", 'w')

chunk = []
for line in data:
    chunk.append(line)
    if line[0] == 'W':
        if 'bike' in str(chunk):
            for line in chunk:
                output.write(line)
        chunk = []
Sign up to request clarification or add additional context in comments.

Comments

1

You can use regular expressions:

import re
data = open("OWS.txt", 'r').read()   # Read the entire file into a string
output = open("result.txt", 'w')

for match in re.finditer(
    r"""(?mx)          # Verbose regex, ^ matches start of line
    ^T\s+(?P<T>.*)\s*  # Match first line
    ^U\s+(?P<U>.*)\s*  # Match second line
    ^W\s+(?P<W>.*)\s*  # Match third line""", 
    data):
        if "bike" in match.group("W"):
            output.write(match.group())  # outputs entire match

3 Comments

have you considered the issue of memory? # Read the entire file into a string
@FrankWANG: Well, how big are your files?
it's 26 G, although i can split it into smaller ones.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.