3

I'm trying to implement a sliding/moving window approach on lines of a csv file using Python. Each line can have a column with a binary value yes or no. Basically, I want to rare yes noises. That means if say we have 3 yes lines in a window of 5 (max of 5), keep them. But if there is 1 or 2, let's change them to no. How can I do that?

For instance, the following yes should both become no.

...
1,a1,b1,no,0.75
2,a2,b2,no,0.45
3,a3,b3,yes,0.98
4,a4,b4,yes,0.22
5,a5,b5,no,0.46
6,a6,b6,no,0.20
...

But in the followings, we keep as is (there can be a window of 5 where 3 of them are yes):

...
1,a1,b1,no,0.75
2,a2,b2,no,0.45
3,a3,b3,yes,0.98
4,a4,b4,yes,0.22
5,a5,b5,no,0.46
6,a6,b6,yes,0.20
...

I attempted writing something, having a window of 5, but got stuck (it is not complete):

        window_size = 5 
        filename='C:\\Users\\username\\v3\\And-'+v3file.split("\\")[5]
        with open(filename) as fin:
            with open('C:\\Users\\username\\v4\\And2-'+v3file.split("\\")[5],'w') as finalout:
                line= fin.readline()
                index = 0
                sequence= []
                accs=[]
                while line:
                    print(line)
                    for i in range(window_size):
                        line = fin.readline()
                        sequence.append(line)
                    index = index + 1
                    fin.seek(index)
6
  • 1
    Are you trying to solve, keeping the most recent three rows an a variable/window? Commented Dec 17, 2019 at 19:17
  • @wwii Actually let's say max out of a window of 5 (3 yes not necessarily need to be all in sequence). Updated the question a bit. Commented Dec 17, 2019 at 19:22
  • 1
    Is the file very large? Is it important to read one line of the file at a time? If you read the entire file into memory, your problem becomes easier and code will become cleaner, and you don't have to do things like fin.seek Commented Dec 17, 2019 at 19:22
  • Can you provide a more complete sample, and what the subsequent output should look like? Commented Dec 17, 2019 at 19:22
  • @vasia file can be up to 10MB. But if you think it fits memory, then fine. Commented Dec 17, 2019 at 19:24

2 Answers 2

4

You can use collections.deque with the maxlen argument set to the desired window size to implement a sliding window that keeps track of the yes/no flags of the most recent 5 rows. Keep a count of yeses instead of calculating the sum of yeses in the sliding window in every iteration to be more efficient. When you have a full-size sliding window and the count of yeses is greater than 2, add the line indices of these yeses to a set where the yeses should be kept as-is. And the in the second pass after resetting the file pointer of the input, alter the yeses to noes if the line indices are not in the set:

from collections import deque

window_size = 5
with open(filename) as fin, open(output_filename, 'w') as finalout:
    yeses = 0
    window = deque(maxlen=5)
    preserved = set()
    for index, line in enumerate(fin):
        window.append('yes' in line)
        if window[-1]:
            yeses += 1
        if len(window) == window_size:
            if yeses > 2:
                preserved.update(i for i, f in enumerate(window, index - window_size + 1) if f)
            if window[0]:
                yeses -= 1
    fin.seek(0)
    for index, line in enumerate(fin):
        if index not in preserved:
            line = line.replace('yes', 'no')
        finalout.write(line)

Demo: https://repl.it/@blhsing/StripedCleanCopyrightinfringement

Sign up to request clarification or add additional context in comments.

8 Comments

Thanks. Any chance not to use csv? Let's just generalize to a text file where yes exists in a line. I didn't add csv in my title to make it general.
Edited accordingly then.
@TinaJ Since it's actually a CSV, I don't think telling people to generalize is going to help give you the correct solution. It's pretty trivial to read the CSV correctly and doesn't add any complexity (if anything, it reduces complexity).
Edited accordingly then.
Great, it works. And yeses > 2 was supposed to be the max count of the whole window, but already know how to edit that. So Thanks!
|
0

Here is a 5-liner solution based on building successive list comprehensions:

lines = [
'1,a1,b1,no,0.75',
'2,a2,b2,yes,0.45',
'3,a3,b3,yes,0.98',
'4,a4,b4,yes,0.22',
'5,a5,b5,no,0.46',
'6,a6,b6,no,0.98',
'7,a7,b7,yes,0.22',
'8,a8,b8,no,0.46',
'9,a9,b9,no,0.20']

n = len(lines)

# flag all lines containing 'yes' (add 2 empty lines at boundaries to avoid pbs)
flags = [line.count('yes') for line in ['', '']+lines+['', '']]
# count number of flags in sliding window [p-2,p+2]
counts = [sum(flags[p-2:p+3]) for p in range(2,n+2)]
# tag lines that need to be changed
tags = [flag > 0 and count < 3 for (flag,count) in zip(flags[2:],counts)]
# change tagged lines
for n in range(n):
  if tags[n]: lines[n] = lines[n].replace('yes','no')

print(lines)

Result:

['1,a1,b1,no,0.75',
 '2,a2,b2,yes,0.45',
 '3,a3,b3,yes,0.98',
 '4,a4,b4,yes,0.22',
 '5,a5,b5,no,0.46',
 '6,a6,b6,no,0.98',
 '7,a7,b7,no,0.22',
 '8,a8,b8,no,0.46',
 '9,a9,b9,no,0.20']

EDIT : As you read your data from a standard text file, all you have to do is:

with file(filename,'r') as f:
  lines = f.read().strip().split('\n')

(strip to remove potential blank lines at top or bottom on file, split(\n) to turn file content into a list of lines) then use the code above...

2 Comments

Thanks. Can you complete your code with reading from and writing to a file?
Nice chat indeed. I remove my comments also. Regards

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.