5

I'm new to Python and trying to do a nested loop. I have a very large file (1.1 million rows), and I'd like to use it to create a file that has each line along with the next N lines, for example with the next 3 lines:

1    2
1    3
1    4
2    3
2    4
2    5

Right now I'm just trying to get the loops working with rownumbers instead of the strings since it's easier to visualize. I came up with this code, but it's not behaving how I want it to:

with open('C:/working_file.txt', mode='r', encoding = 'utf8') as f: 
for i, line in enumerate(f):
     line_a = i
     lower_bound = i + 1
     upper_bound = i + 4
     with open('C:/working_file.txt', mode='r', encoding = 'utf8') as g:
        for j, line in enumerate(g):
            while j >= lower_bound and j <= upper_bound:
                line_b = j
                j = j+1
                print(line_a, line_b)

Instead of the output I want like above, it's giving me this:

990     991
990     992
990     993
990     994
990     992
990     993
990     994
990     993
990     994
990     994

As you can see the inner loop is iterating multiple times for each line in the outer loop. It seems like there should only be one iteration per line in the outer loop. What am I missing?

EDIT: My question was answered below, here is the exact code I ended up using:

from collections import deque
from itertools import cycle
log = open('C:/example.txt', mode='w', encoding = 'utf8') 
try:
    xrange 
except NameError: # python3
    xrange = range

def pack(d):
    tup = tuple(d)
    return zip(cycle(tup[0:1]), tup[1:])

def window(seq, n=2):
    it = iter(seq)
    d = deque((next(it, None) for _ in range(n)), maxlen=n)
    yield pack(d)
    for e in it:
        d.append(e)
        yield pack(d)

for l in window(open('c:/working_file.txt', mode='r', encoding='utf8'),100):
    for a, b in l:
        print(a.strip() + '\t' + b.strip(), file=log)
4
  • 2
    for j, line in enumerate(g) and j = j+1 should never ever go together... Commented Dec 10, 2013 at 0:08
  • I don't see how else it can work - you are having a loop within a loop. Of course line_a stays the same for all your iterations through file g. Commented Dec 10, 2013 at 0:10
  • @sashkello Why should that not ever be done? What is the alternative? I just started learning python. Commented Dec 10, 2013 at 0:20
  • for i in mylist iterates over all objects within mylist. Modifying i at the same time makes the program confusing because i is not necessarily within the list any more. In your case you can do for n in range(lower_bound, upper_bound+1). Commented Dec 10, 2013 at 0:25

5 Answers 5

5

Based on window example from old docs you can use something like:

from collections import deque
from itertools import cycle

try:
    xrange 
except NameError: # python3
    xrange = range

def pack(d):
    tup = tuple(d)
    return zip(cycle(tup[0:1]), tup[1:])

def window(seq, n=2):
    it = iter(seq)
    d = deque((next(it, None) for _ in xrange(n)), maxlen=n)
    yield pack(d)
    for e in it:
        d.append(e)
        yield pack(d)

Demo:

>>> for l in window([1,2,3,4,5], 4):
...     for l1, l2 in l:
...         print l1, l2
...
1 2
1 3
1 4
2 3
2 4
2 5

So, basically you can pass your file to window to get desired result:

window(open('C:/working_file.txt', mode='r', encoding='utf8'), 4)
Sign up to request clarification or add additional context in comments.

3 Comments

+1 for itertools. This is much, much better than my solution with readlines, because it doesn't read the entire file into memory. But note that the OP seems to be using Python 3, so some of the code needs adjusting - xrange -> range jumps out, for example.
+1 for using recipes I've seen before instead of reinventing the wheel.
This worked perfectly for me with a little tweaking - thank you! I updated my question with the exact code I used.
1

You can do this with slices. This is easiest if you read the whole file into a list first:

with open('C:/working_file.txt', mode='r', encoding = 'utf8') as f: 
    data = f.readlines()

for i, line_a in enumerate(data):
    for j, line_b in enumerate(data[i+1:i+5], start=i+1):
        print(i, j)

When you change it to printing the lines instead of the line numbers, you can drop the second enumerate and just do for line_b in data[i+1:i+5]. Note that the slice includes the item at the start index, but not the item at the end index, so that needs to be one higher than your current upper bound.

Comments

1

Based on alko's answer, I would suggest using the window recipe unmodified

from itertools import islice

def window(seq, n=2):
    "Returns a sliding window (of width n) over data from the iterable"
    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result    
    for elem in it:
        result = result[1:] + (elem,)
        yield result

for l in window([1,2,3,4,5], 4):
    for item in l[1:]:
        print l[0], item

Comments

0

I think the easiest way to solve this problem would be to read your file into a dictionary...

my_data = {}
for i, line in enumerate(f):
    my_data[i] = line

After that is done you can do

for x in my_data:
    for y in range(1, 4):
        print my_data[x], my_data[x + y]

As written you are reading your million line file a million times for each line...

5 Comments

Thanks Paul - so am I correct that you're suggesting this? I get an error: f = open('C:/working_file.txt', mode='r', encoding = 'utf8') my_data = {} for i, line in f: my_data[i] = line for x in my_data: for y in range(1, 4): out.write(my_data[x] + " " + my_data[x + y]
What error did you get? I just re-read your code and realized you were using a print statement for output- I changed out.write to print
Here's the error: Traceback (most recent call last): File "loop_test.py", line 20, in <module> for i, line in f: ValueError: too many values to unpack (expected 2)
I forgot the enumerate in the first loop. My apologies.
converting to dict is unnecessary overhead, as you have to construct it, at least evaluating hashes for all the keys (i.e. for all the lines in file), while you can operate over list instead; moreover, loading all the file is also unnecessary overhead, and this approach combines those two.
0

Since this was quite a big file, you might not want to load it all in memory at once. So to avoid reading a line more than once this is what you do.

  • Make a list with N elements, where N is the amount of next lines to read.

    • When you read the first line, add that to the first item in the list.
    • Add the nest line to the first and second item.
    • and so on for each line
  • When a item in that list reaches a length N, take it out and append it to the output file. And add a empty item at the end so you still have a list of N items.

This way you only need to read each line once, and you wont have to load the whole file in memory. You only need to hold, at max, N! lines in memory.

1 Comment

This is roughly what alko's itertools solution does, except it has O(N) memory usage instead of O(N!).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.