python sliding window average for letters

Question

I have a file like this:

chr1 1 A 3
chr1 2 G 3
chr1 3 T 3
chr1 4 C 2
chr1 5 G 1
chr1 6 T 2
chr1 7 G 3
chr1 8 C 3
chr1 9 A 5
chr1 10 A 8
chr2 5 A 1
chr2 6 G 0
chr2 7 G 0
chr2 8 G 0
chr2 9 C 2
chr2 10 T 3
chr2 11 A 3

What I would like to do is: setting a window size (let's say 2), move with it along the file, and compute average of column 4 and % of G+C inside the window.

I have sth like this for now:

import numpy
def movingaverage(interval, window_size):
    window = numpy.ones(int(window_size))/float(window_size)
    return numpy.convolve(interval, window, 'same')

it does the work for 4th column, but I don't know how to apply this to calculate content of G+C inside the window frame

Cheers, Irek

Not sure I understand. Do you mean if the content is G+A, the % is 50%? And if it is G+C, it is 100%? — Dman2
– Dman2, Commented Sep 6, 2013 at 11:02

ovgolovin · Accepted Answer · 2013-09-06 11:36:49Z

1

from __future__ import division
from itertools import tee, izip
from collections import Counter

text = '''\
chr1 1 A 3
chr1 2 G 3
chr1 3 T 3
chr1 4 C 2
chr1 5 G 1
chr1 6 T 2
chr1 7 G 3
chr1 8 C 3
chr1 9 A 5
chr1 10 A 8
chr2 5 A 1
chr2 6 G 0
chr2 7 G 0
chr2 8 G 0
chr2 9 C 2
chr2 10 T 3
chr2 11 A 3'''


def window(iterable, size):
    iters = tee(iterable, size)
    for i in xrange(1, size):
        for each in iters[i:]:
            next(each, None)
    return izip(*iters)


def get_avg(lists, column):
    return sum(zip(*lists)[column]) / len(lists)


def get_GC_percentage(lists, column):
    counts = Counter(zip(*lists)[column])
    return (counts['C'] + counts['G']) / len(lists)


line_tuples = (line.split() for line in text.split('\n'))
line_tuples_casted = ((a,int(b),c,int(d)) for a,b,c,d in line_tuples)
line_tuples_chunks = window(line_tuples_casted, 2)

for (i,chunk) in enumerate(line_tuples_chunks):
    print 'i: {:2} | avg: {} | GC_content: {:5.0%}'.format(i, get_avg(chunk, 3), get_GC_percentage(chunk, 2))

Output:

i:  0 | avg: 3.0 | GC_content:   50%
i:  1 | avg: 3.0 | GC_content:   50%
i:  2 | avg: 2.5 | GC_content:   50%
i:  3 | avg: 1.5 | GC_content:  100%
i:  4 | avg: 1.5 | GC_content:   50%
i:  5 | avg: 2.5 | GC_content:   50%
i:  6 | avg: 3.0 | GC_content:  100%
i:  7 | avg: 4.0 | GC_content:   50%
i:  8 | avg: 6.5 | GC_content:    0%
i:  9 | avg: 4.5 | GC_content:    0%
i: 10 | avg: 0.5 | GC_content:   50%
i: 11 | avg: 0.0 | GC_content:  100%
i: 12 | avg: 0.0 | GC_content:  100%
i: 13 | avg: 1.0 | GC_content:  100%
i: 14 | avg: 2.5 | GC_content:   50%
i: 15 | avg: 3.0 | GC_content:    0%

But note, this is not quite optimal solution. We could do better by not calculating average on each iteration for the whole window, but update it using the values which leave the window and come to it.

edited Sep 6, 2013 at 11:36

answered Sep 6, 2013 at 11:26

ovgolovin

13.5k7 gold badges53 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Irek Over a year ago

hmm where exactly can I scale the window size? I assume that line_tuples_chunks ? Or do i also have change the value somewhere else?

ovgolovin Over a year ago

@lerk Yeah, window size is set at line_tuples_chunks = window(line_tuples_casted, 2). It is the only place the window size is determined. It is a good coding style to define the same functionality only once.

jabaldonedo · Accepted Answer · 2013-09-06 11:30:17Z

1

If I have understood what you want to do, here is one way (not optimal but works):

First data file file.data

chr1 1 A 3
chr1 2 G 3
chr1 3 T 3
chr1 4 C 2
chr1 5 G 1
chr1 6 T 2
chr1 7 G 3
chr1 8 C 3
chr1 9 A 5
chr1 10 A 8
chr2 5 A 1
chr2 6 G 0
chr2 7 G 0
chr2 8 G 0
chr2 9 C 2
chr2 10 T 3
chr2 11 A 3

Now the script:

import numpy as np

d = {'A':0, 'G': 1, 'T':2, 'C':3, 'U':4}
data = np.loadtxt('file.data', delimiter=' ', converters = {0: lambda x: int(x[-1]), 2: lambda x: d[x]})
win_size = 2

for i in range(data.shape[0] / win_size):
    m = data[i:i+win_size,:]
    avg = np.mean(m[:,3])
    cg_per = float(np.where( ( m[:,2] == d['G'] )| ( m[:,2] == d['C']) )[0].shape[0]) * 100 / win_size

    print "Window {0} avg:{1} C+G={2}%".format(i, avg, cg_per)

It will generate:

Window 0 avg:3.0 C+G=50.0%
Window 1 avg:3.0 C+G=50.0%
Window 2 avg:2.5 C+G=50.0%
Window 3 avg:1.5 C+G=100.0%
Window 4 avg:1.5 C+G=50.0%
Window 5 avg:2.5 C+G=50.0%
Window 6 avg:3.0 C+G=100.0%
Window 7 avg:4.0 C+G=50.0%

answered Sep 6, 2013 at 11:30

jabaldonedo

26.7k8 gold badges80 silver badges77 bronze badges

1 Comment

Irek Over a year ago

ahh I forgot to mention that's a huge file. dictionary will kill my memory, not to mention index is easily out of range

Jaime · Accepted Answer · 2013-09-06 13:02:25Z

1

You seem to already know how to get your data into a numpy array, so lets say you have them in an array named bases. If you now do:

base_mask = (bases == 'G') | (basses == 'C')

You have a boolean mask with True wherever the array has a G or a C, and False elsewhere. Since booleans casted to ints have the Trues converted to 1s and the Falses to 0s, simply compute the average on the base_mask array the same way you are doing for interval.

answered Sep 6, 2013 at 13:02

Jaime

67.7k19 gold badges128 silver badges164 bronze badges

1 Comment

Irek Over a year ago

yep, I thought of it, even in simpler form, simply changing all C,G to 1, other to 0 without intermediate false/true (don't know if ommiting boolean would make it quicker)

simleo · Accepted Answer · 2013-09-09 12:53:44Z

1

Deques are great for sliding windows. Try the following:

from collections import deque
from itertools import islice

FN = "temp.txt"
WSIZE = 2

def gen_stream(f):
    for line in f:
        line = line.split()
        yield [1 if line[2] in 'GC' else 0, int(line[3])]

def overlapping():
    with open(FN) as f:
        stream = gen_stream(f)
        window = deque([stream.next() for _ in xrange(WSIZE-1)], WSIZE)
        for row in stream:
            window.append(row)
            print [sum(row[i] for row in window)/float(WSIZE) for i in xrange(2)]

def non_overlapping():
    with open(FN) as f:
        stream = gen_stream(f)
        while True:
            chunk = list(islice(stream, WSIZE))
            if not chunk:
                break
            print [sum(row[i] for row in chunk)/float(len(chunk)) for i in xrange(2)]

This is scalable, i.e., works for a huge file.

edited Sep 9, 2013 at 12:53

answered Sep 6, 2013 at 13:32

simleo

3,00525 silver badges25 bronze badges

4 Comments

Irek Over a year ago

cool, great solution. I will look if there aren't any unusal things. This is no overlapping window, right?

Irek Over a year ago

I tried to change the window size, and swittched WSIZE only, but I have a feeling, that it isn't enough?

simleo Over a year ago

My initial solution was meant for overlapping windows. I just edited the original answer to include a solution for non-overlapping ones. In the latter case, using a deque would bring little advantage

ovgolovin Over a year ago

overlapping seems to raise StopIteration for example when WSIZE=3 and stream has only one value in it.

Collectives™ on Stack Overflow

python sliding window average for letters

4 Answers 4

2 Comments

1 Comment

1 Comment

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

1 Comment

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related