Slicing data block by block using Python

Question

everyone, I have a big file in the format given below. The data is in the "block" format. one "block" containing three rows: the time T, the user U, and the content W. for example, this is a block:

T   2009-06-11 21:57:23
U   tracygazzard
W   David Letterman is good man

Since i will only using the block containing specific key word. I slice the data from the original massive data block by block, rather than dump the whole data into memory. each time read in one block, and if the row of content containing the word of "bike", write this block into disk.

you can use the following two blocks to test your script.

T   2009-06-11 21:57:23
U   tracygazzard
W   David Letterman is good man

T   2009-06-11 21:57:23
U   charilie
W   i want a bike

I have tried to do the work line by line:

data = open("OWS.txt", 'r')
output = open("result.txt", 'w')

for line in data:
    if line.find("bike")!= -1:
    output.write(line)

thank you， I have tried using for line in data: if line.find("bike")!= -1: output.write(line) — Frank Wang
– Frank Wang, Commented May 5, 2012 at 7:53
thus i can solve the problem line by line, but i don't know how to do it block by block. and you don't need to give all the code. just the key part. — Frank Wang
– Frank Wang, Commented May 5, 2012 at 7:55
Do the lines in each block actually start with T, U and W? — Tim Pietzcker
– Tim Pietzcker, Commented May 5, 2012 at 8:03
I have tried the script in the hyperlink before, however, it's not concise. But you can also refer to github.com/chengjun/python/blob/master/workshop%20of%20python/… — Frank Wang
– Frank Wang, Commented May 5, 2012 at 8:09

fraxel · Accepted Answer · 2012-05-05 10:20:13Z

1

As the format of your blocks is constant, you can use a list to hold a block, then see if bike is in that block:

data = open("OWS.txt", 'r')
output = open("result.txt", 'w')

chunk = []
for line in data:
    chunk.append(line)
    if line[0] == 'W':
        if 'bike' in str(chunk):
            for line in chunk:
                output.write(line)
        chunk = []

answered May 5, 2012 at 10:20

fraxel

35.4k11 gold badges101 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Tim Pietzcker · Accepted Answer · 2012-05-05 08:08:52Z

1

You can use regular expressions:

import re
data = open("OWS.txt", 'r').read()   # Read the entire file into a string
output = open("result.txt", 'w')

for match in re.finditer(
    r"""(?mx)          # Verbose regex, ^ matches start of line
    ^T\s+(?P<T>.*)\s*  # Match first line
    ^U\s+(?P<U>.*)\s*  # Match second line
    ^W\s+(?P<W>.*)\s*  # Match third line""", 
    data):
        if "bike" in match.group("W"):
            output.write(match.group())  # outputs entire match

answered May 5, 2012 at 8:08

Tim Pietzcker

337k59 gold badges521 silver badges572 bronze badges

3 Comments

Frank Wang Over a year ago

have you considered the issue of memory? # Read the entire file into a string

Tim Pietzcker Over a year ago

@FrankWANG: Well, how big are your files?

Frank Wang Over a year ago

it's 26 G, although i can split it into smaller ones.

Collectives™ on Stack Overflow

Slicing data block by block using Python

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related