How to only read lines in a text file after a certain string?

Question

I'd like to read to a dictionary all of the lines in a text file that come after a particular string. I'd like to do this over thousands of text files.

I'm able to identify and print out the particular string ('Abstract') using the following code (gotten from this answer):

for files in filepath:
    with open(files, 'r') as f:
        for line in f:
            if 'Abstract' in line:
                print line;

But how do I tell Python to start reading the lines that only come after the string?

Krister Janmore · Accepted Answer · 2020-10-24 12:36:13Z

31

Just start another loop when you reach the line you want to start from:

for files in filepath:
    with open(files, 'r') as f:
        for line in f:
            if 'Abstract' in line:                
                for line in f: # now you are at the lines you want
                    # do work

A file object is its own iterator, so when we reach the line with 'Abstract' in it we continue our iteration from that line until we have consumed the iterator.

A simple example:

gen = (n for n in xrange(8))

for x in gen:
    if x == 3:
        print('Starting second loop')
        for x in gen:
            print('In second loop', x)
    else:
        print('In first loop', x)

Produces:

In first loop 0
In first loop 1
In first loop 2
Starting second loop
In second loop 4
In second loop 5
In second loop 6
In second loop 7

You can also use itertools.dropwhile to consume the lines up to the point you want:

from itertools import dropwhile

for files in filepath:
    with open(files, 'r') as f:
        dropped = dropwhile(lambda _line: 'Abstract' not in _line, f)
        next(dropped, '')
        for line in dropped:
                print(line)

edited Oct 24, 2020 at 12:36

Krister Janmore

956 bronze badges

answered Jan 6, 2015 at 19:42

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Kroltan Over a year ago

It works, but it's kinda strange, don't you think? and anyone who doesn't understand how generators work will scratch their head as of why it produces correct output.

Padraic Cunningham Over a year ago

@Kroltan, well I presume people looking at python know how python code works. This is pretty basic python

Kroltan Over a year ago

Well but I wouldn't be so sure the OP is aware of that.

Kyle Burkett Over a year ago

this doesnt work for me... it just doesnt work it starts at the beginning every time... the loop is embeded and it still starts at the beginning

Padraic Cunningham Over a year ago

@KyleBurkett, that is simply not possible, whatever you consume from an iterator is gone, if it does not work then you are doing something wrong not the code so maybe debugging your code instead of downvoting might be a better option.

|

Kroltan · Accepted Answer · 2015-01-06 19:41:01Z

9

Use a boolean to ignore lines up to that point:

found_abstract = False
for files in filepath:
    with open(files, 'r') as f:
        for line in f:
            if 'Abstract' in line:
                found_abstract = True
            if found_abstract:
                #do whatever you want

answered Jan 6, 2015 at 19:41

Kroltan

5,1565 gold badges40 silver badges64 bronze badges

Comments

Jon Clements · Accepted Answer · 2015-01-06 19:52:43Z

8

You can use itertools.dropwhile and itertools.islice here, a pseudo-example:

from itertools import dropwhile, islice

for fname in filepaths:
    with open(fname) as fin:
        start_at = dropwhile(lambda L: 'Abstract' not in L.split(), fin)
        for line in islice(start_at, 1, None): # ignore the line still with Abstract in
            print line

edited Jan 6, 2015 at 19:52

user2555451

answered Jan 6, 2015 at 19:47

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

Comments

eguaio · Accepted Answer · 2017-09-28 09:47:43Z

8

To me, the following code is easier to understand.

with open(file_name, 'r') as f:
    while not 'Abstract' in next(f):
        pass
    for line in f:
        #line will be now the next line after the one that contains 'Abstract'

edited Sep 28, 2017 at 9:47

answered Oct 31, 2016 at 18:21

eguaio

3,9741 gold badge27 silver badges38 bronze badges

4 Comments

yehudahs Over a year ago

I am getting AttributeError: '_io.TextIOWrapper' object has no attribute 'next'

eguaio Over a year ago

Hoy are probably using python 3.0. Try next(f) instead of f.next() and let me know if it worked.

Erdss4 Over a year ago

When I use a str variable instead of a hard coded value, I get a stop iteration error :(

eguaio Over a year ago

I don't think that error can be caused by using a string variable. It is probably because the string is not present in file. If it is, perhaps the encoding of the file is introcuding some problems.

Henry Keiter · Accepted Answer · 2015-01-06 19:52:03Z

Just to clarify, your code already "reads" all the lines. To start "paying attention" to lines after a certain point, you can just set a boolean flag to indicate whether or not lines should be ignored, and check it at each line.

pay_attention = False
for line in f:
    if pay_attention:
        print line
    else:  # We haven't found our trigger yet; see if it's in this line
        if 'Abstract' in line:
            pay_attention = True

If you don't mind a little more rearranging of your code, you can also use two partial loops instead: one loop that terminates once you've found your trigger phrase ('Abstract'), and one that reads all following lines. This approach is a little cleaner (and a very tiny bit faster).

for skippable_line in f:  # First skim over all lines until we find 'Abstract'.
    if 'Abstract' in skippable_line:
        break
for line in f:  # The file's iterator starts up again right where we left it.
    print line

The reason this works is that the file object returned by open behaves like a generator, rather than, say, a list: it only produces values as they are requested. So when the first loop stops, the file is left with its internal position set at the beginning of the first "unread" line. This means that when you enter the second loop, the first line you see is the first line after the one that triggered the break.

Steve Jessop · Accepted Answer · 2015-01-06 20:03:26Z

1

Making a guess as to how the dictionary is involved, I'd write it this way:

lines = dict()
for filename in filepath:
   with open(filename, 'r') as f:
       for line in f:
           if 'Abstract' in line:
               break
       lines[filename] = tuple(f)

So for each file, your dictionary contains a tuple of lines.

This works because the loop reads up to and including the line you identify, leaving the remaining lines in the file ready to be read from f.

answered Jan 6, 2015 at 20:03

Steve Jessop

281k40 gold badges473 silver badges709 bronze badges

Collectives™ on Stack Overflow

How to only read lines in a text file after a certain string?

6 Answers 6

6 Comments

Comments

Comments

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

6 Comments

Comments

Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related