1

I have a list of words(tokens) through which I iterate. I want to perform a certain transformation on moving windows of that list. The size of the windows size can be of variable length.

for i in range(0,len(tokens)-(window_size+1),step):
    doc2vec.model.infer_vector(tokens[i:i+window_size])

The for loop goes through the length of the tokens at a step defined in the variable, it takes as many token as the variable window_size says. The problem I see is in the last iteration. The iteration ends at the the length of the tokens - the windows size(+1 so that the substracted value is included). Let's say the window size is 10 and the step is 5 and the length of tokens is 98. In such a situation my code would do the last calculation at 85:95 and leave out the last three elements. I want to a solution that would work for variable window_size, step and tokens length. To illustrate, as of now it would work fine if the length of tokens is 95, but if it is 98 three elements would be left. I would want them to be calculated together 88:98.

2
  • but should there be a superposition on the last window different from the step? in your example the last batch is 85:95, do you want to make an additional 88:98 batch overriding the current step? Commented Oct 6, 2020 at 11:13
  • Yes I want the window 85:95 processed and then the window 88:98. Commented Oct 6, 2020 at 18:58

1 Answer 1

1

I think the way to go is creating your own custom iterator:

class MovingWindow:
    def __init__(self, tokens, window_size, step):
        self.current = -step
        self.last = len(tokens) - window_size + 1
        self.remaining = (len(tokens) - window_size) % step
        self.tokens = tokens
        self.window_size = window_size
        self.step = step

    def __iter__(self):
        return self

    def __next__(self):
        self.current += self.step
        if self.current < self.last:
            return self.tokens[self.current : self.current + self.window_size]
        elif self.remaining:
            self.remaining = 0
            return self.tokens[-self.window_size:]
        else:
            raise StopIteration

witch you will access with:

for t in MovingWindow(tokens, 10, 5):
    doc2vec.model.infer_vector(t)

you could also modify the iterator so it return the indexes instead of the tokens. And another option is to create a simple generator, more information here

to illustrate the case example you provided:

indexes = [i for i in range(98)]
for i in MovingWindow(indexes, 10, 5):
    print(f'{i[0]}:{i[-1]}')

output:

0:9
5:14
10:19
15:24
20:29
25:34
30:39
35:44
40:49
45:54
50:59
55:64
60:69
65:74
70:79
75:84
80:89
85:94
88:97
Sign up to request clarification or add additional context in comments.

6 Comments

Thank you, self.remaining = (len(tokens) + step) % window_size I think is a confused way to calculate the leftover words. len(tokens) - window_size gives the actual number, however your way does not lead to a faulty result.
Hi Borut. I guess you meant "len(tokens) % window_size" right? I've tried this at first but it leads to an error when len(tokens) = 95. As you see it will get the left over of /10 witch is 5 but you will get a duplicate list on the end since the step matches perfectly. I've rerun my tests and actually I've made a mistake, the correct way is "(len(tokens) + window_size) % step". Please let me know if you find a better way to simplify and thanks for the response.
Yes, (len(tokens)+window_size) % step is the correct one or (len(tokens)+window_size) % step it is the same thing. What type of tests did you use?
I mean (len(tokens)-window_size) % step is the same.
Actually (len(tokens)-window_size) % step appears correct and (len(tokens)+ window_size) % step is not. I found counter example len 82, window 7 step 5. if you use first formula you get 0 remaining and if you use second you get 4.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.