Implement sliding window on file lines in Python

Question

I'm trying to implement a sliding/moving window approach on lines of a csv file using Python. Each line can have a column with a binary value yes or no. Basically, I want to rare yes noises. That means if say we have 3 yes lines in a window of 5 (max of 5), keep them. But if there is 1 or 2, let's change them to no. How can I do that?

For instance, the following yes should both become no.

...
1,a1,b1,no,0.75
2,a2,b2,no,0.45
3,a3,b3,yes,0.98
4,a4,b4,yes,0.22
5,a5,b5,no,0.46
6,a6,b6,no,0.20
...

But in the followings, we keep as is (there can be a window of 5 where 3 of them are yes):

...
1,a1,b1,no,0.75
2,a2,b2,no,0.45
3,a3,b3,yes,0.98
4,a4,b4,yes,0.22
5,a5,b5,no,0.46
6,a6,b6,yes,0.20
...

I attempted writing something, having a window of 5, but got stuck (it is not complete):

        window_size = 5 
        filename='C:\\Users\\username\\v3\\And-'+v3file.split("\\")[5]
        with open(filename) as fin:
            with open('C:\\Users\\username\\v4\\And2-'+v3file.split("\\")[5],'w') as finalout:
                line= fin.readline()
                index = 0
                sequence= []
                accs=[]
                while line:
                    print(line)
                    for i in range(window_size):
                        line = fin.readline()
                        sequence.append(line)
                    index = index + 1
                    fin.seek(index)

Are you trying to solve, keeping the most recent three rows an a variable/window? — wwii
– wwii, Commented Dec 17, 2019 at 19:17
@wwii Actually let's say max out of a window of 5 (3 yes not necessarily need to be all in sequence). Updated the question a bit. — angel_30
– angel_30, Commented Dec 17, 2019 at 19:22
Is the file very large? Is it important to read one line of the file at a time? If you read the entire file into memory, your problem becomes easier and code will become cleaner, and you don't have to do things like fin.seek — vasia
– vasia, Commented Dec 17, 2019 at 19:22
Can you provide a more complete sample, and what the subsequent output should look like? — PMende
– PMende, Commented Dec 17, 2019 at 19:22
@vasia file can be up to 10MB. But if you think it fits memory, then fine. — angel_30
– angel_30, Commented Dec 17, 2019 at 19:24

blhsing · Accepted Answer · 2019-12-17 21:57:05Z

4

You can use collections.deque with the maxlen argument set to the desired window size to implement a sliding window that keeps track of the yes/no flags of the most recent 5 rows. Keep a count of yeses instead of calculating the sum of yeses in the sliding window in every iteration to be more efficient. When you have a full-size sliding window and the count of yeses is greater than 2, add the line indices of these yeses to a set where the yeses should be kept as-is. And the in the second pass after resetting the file pointer of the input, alter the yeses to noes if the line indices are not in the set:

from collections import deque

window_size = 5
with open(filename) as fin, open(output_filename, 'w') as finalout:
    yeses = 0
    window = deque(maxlen=5)
    preserved = set()
    for index, line in enumerate(fin):
        window.append('yes' in line)
        if window[-1]:
            yeses += 1
        if len(window) == window_size:
            if yeses > 2:
                preserved.update(i for i, f in enumerate(window, index - window_size + 1) if f)
            if window[0]:
                yeses -= 1
    fin.seek(0)
    for index, line in enumerate(fin):
        if index not in preserved:
            line = line.replace('yes', 'no')
        finalout.write(line)

Demo: https://repl.it/@blhsing/StripedCleanCopyrightinfringement

edited Dec 17, 2019 at 21:57

answered Dec 17, 2019 at 21:25

blhsing

109k9 gold badges88 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

angel_30 Over a year ago

Thanks. Any chance not to use csv? Let's just generalize to a text file where yes exists in a line. I didn't add csv in my title to make it general.

blhsing Over a year ago

Edited accordingly then.

ggorlen Over a year ago

@TinaJ Since it's actually a CSV, I don't think telling people to generalize is going to help give you the correct solution. It's pretty trivial to read the CSV correctly and doesn't add any complexity (if anything, it reduces complexity).

blhsing Over a year ago

Edited accordingly then.

angel_30 Over a year ago

Great, it works. And yeses > 2 was supposed to be the max count of the whole window, but already know how to edit that. So Thanks!

|

sciroccorics · Accepted Answer · 2019-12-17 21:00:48Z

0

Here is a 5-liner solution based on building successive list comprehensions:

lines = [
'1,a1,b1,no,0.75',
'2,a2,b2,yes,0.45',
'3,a3,b3,yes,0.98',
'4,a4,b4,yes,0.22',
'5,a5,b5,no,0.46',
'6,a6,b6,no,0.98',
'7,a7,b7,yes,0.22',
'8,a8,b8,no,0.46',
'9,a9,b9,no,0.20']

n = len(lines)

# flag all lines containing 'yes' (add 2 empty lines at boundaries to avoid pbs)
flags = [line.count('yes') for line in ['', '']+lines+['', '']]
# count number of flags in sliding window [p-2,p+2]
counts = [sum(flags[p-2:p+3]) for p in range(2,n+2)]
# tag lines that need to be changed
tags = [flag > 0 and count < 3 for (flag,count) in zip(flags[2:],counts)]
# change tagged lines
for n in range(n):
  if tags[n]: lines[n] = lines[n].replace('yes','no')

print(lines)

Result:

['1,a1,b1,no,0.75',
 '2,a2,b2,yes,0.45',
 '3,a3,b3,yes,0.98',
 '4,a4,b4,yes,0.22',
 '5,a5,b5,no,0.46',
 '6,a6,b6,no,0.98',
 '7,a7,b7,no,0.22',
 '8,a8,b8,no,0.46',
 '9,a9,b9,no,0.20']

EDIT : As you read your data from a standard text file, all you have to do is:

with file(filename,'r') as f:
  lines = f.read().strip().split('\n')

(strip to remove potential blank lines at top or bottom on file, split(\n) to turn file content into a list of lines) then use the code above...

edited Dec 17, 2019 at 21:00

answered Dec 17, 2019 at 20:13

sciroccorics

2,4271 gold badge11 silver badges23 bronze badges

2 Comments

angel_30 Over a year ago

Thanks. Can you complete your code with reading from and writing to a file?

sciroccorics Over a year ago

Nice chat indeed. I remove my comments also. Regards

Collectives™ on Stack Overflow

Implement sliding window on file lines in Python

2 Answers 2

8 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related