1

I have a file like this:

chr1 1 A 3
chr1 2 G 3
chr1 3 T 3
chr1 4 C 2
chr1 5 G 1
chr1 6 T 2
chr1 7 G 3
chr1 8 C 3
chr1 9 A 5
chr1 10 A 8
chr2 5 A 1
chr2 6 G 0
chr2 7 G 0
chr2 8 G 0
chr2 9 C 2
chr2 10 T 3
chr2 11 A 3

What I would like to do is: setting a window size (let's say 2), move with it along the file, and compute average of column 4 and % of G+C inside the window.

I have sth like this for now:

import numpy
def movingaverage(interval, window_size):
    window = numpy.ones(int(window_size))/float(window_size)
    return numpy.convolve(interval, window, 'same')

it does the work for 4th column, but I don't know how to apply this to calculate content of G+C inside the window frame

Cheers, Irek

2
  • Not sure I understand. Do you mean if the content is G+A, the % is 50%? And if it is G+C, it is 100%? Commented Sep 6, 2013 at 11:02
  • yep. in simples form: content of the G+C inside the window Commented Sep 6, 2013 at 11:04

4 Answers 4

1
from __future__ import division
from itertools import tee, izip
from collections import Counter

text = '''\
chr1 1 A 3
chr1 2 G 3
chr1 3 T 3
chr1 4 C 2
chr1 5 G 1
chr1 6 T 2
chr1 7 G 3
chr1 8 C 3
chr1 9 A 5
chr1 10 A 8
chr2 5 A 1
chr2 6 G 0
chr2 7 G 0
chr2 8 G 0
chr2 9 C 2
chr2 10 T 3
chr2 11 A 3'''


def window(iterable, size):
    iters = tee(iterable, size)
    for i in xrange(1, size):
        for each in iters[i:]:
            next(each, None)
    return izip(*iters)


def get_avg(lists, column):
    return sum(zip(*lists)[column]) / len(lists)


def get_GC_percentage(lists, column):
    counts = Counter(zip(*lists)[column])
    return (counts['C'] + counts['G']) / len(lists)


line_tuples = (line.split() for line in text.split('\n'))
line_tuples_casted = ((a,int(b),c,int(d)) for a,b,c,d in line_tuples)
line_tuples_chunks = window(line_tuples_casted, 2)

for (i,chunk) in enumerate(line_tuples_chunks):
    print 'i: {:2} | avg: {} | GC_content: {:5.0%}'.format(i, get_avg(chunk, 3), get_GC_percentage(chunk, 2))

Output:

i:  0 | avg: 3.0 | GC_content:   50%
i:  1 | avg: 3.0 | GC_content:   50%
i:  2 | avg: 2.5 | GC_content:   50%
i:  3 | avg: 1.5 | GC_content:  100%
i:  4 | avg: 1.5 | GC_content:   50%
i:  5 | avg: 2.5 | GC_content:   50%
i:  6 | avg: 3.0 | GC_content:  100%
i:  7 | avg: 4.0 | GC_content:   50%
i:  8 | avg: 6.5 | GC_content:    0%
i:  9 | avg: 4.5 | GC_content:    0%
i: 10 | avg: 0.5 | GC_content:   50%
i: 11 | avg: 0.0 | GC_content:  100%
i: 12 | avg: 0.0 | GC_content:  100%
i: 13 | avg: 1.0 | GC_content:  100%
i: 14 | avg: 2.5 | GC_content:   50%
i: 15 | avg: 3.0 | GC_content:    0%

But note, this is not quite optimal solution. We could do better by not calculating average on each iteration for the whole window, but update it using the values which leave the window and come to it.

Sign up to request clarification or add additional context in comments.

2 Comments

hmm where exactly can I scale the window size? I assume that line_tuples_chunks ? Or do i also have change the value somewhere else?
@lerk Yeah, window size is set at line_tuples_chunks = window(line_tuples_casted, 2). It is the only place the window size is determined. It is a good coding style to define the same functionality only once.
1

If I have understood what you want to do, here is one way (not optimal but works):

First data file file.data

chr1 1 A 3
chr1 2 G 3
chr1 3 T 3
chr1 4 C 2
chr1 5 G 1
chr1 6 T 2
chr1 7 G 3
chr1 8 C 3
chr1 9 A 5
chr1 10 A 8
chr2 5 A 1
chr2 6 G 0
chr2 7 G 0
chr2 8 G 0
chr2 9 C 2
chr2 10 T 3
chr2 11 A 3

Now the script:

import numpy as np

d = {'A':0, 'G': 1, 'T':2, 'C':3, 'U':4}
data = np.loadtxt('file.data', delimiter=' ', converters = {0: lambda x: int(x[-1]), 2: lambda x: d[x]})
win_size = 2

for i in range(data.shape[0] / win_size):
    m = data[i:i+win_size,:]
    avg = np.mean(m[:,3])
    cg_per = float(np.where( ( m[:,2] == d['G'] )| ( m[:,2] == d['C']) )[0].shape[0]) * 100 / win_size

    print "Window {0} avg:{1} C+G={2}%".format(i, avg, cg_per)  

It will generate:

Window 0 avg:3.0 C+G=50.0%
Window 1 avg:3.0 C+G=50.0%
Window 2 avg:2.5 C+G=50.0%
Window 3 avg:1.5 C+G=100.0%
Window 4 avg:1.5 C+G=50.0%
Window 5 avg:2.5 C+G=50.0%
Window 6 avg:3.0 C+G=100.0%
Window 7 avg:4.0 C+G=50.0%

1 Comment

ahh I forgot to mention that's a huge file. dictionary will kill my memory, not to mention index is easily out of range
1

You seem to already know how to get your data into a numpy array, so lets say you have them in an array named bases. If you now do:

base_mask = (bases == 'G') | (basses == 'C')

You have a boolean mask with True wherever the array has a G or a C, and False elsewhere. Since booleans casted to ints have the Trues converted to 1s and the Falses to 0s, simply compute the average on the base_mask array the same way you are doing for interval.

1 Comment

yep, I thought of it, even in simpler form, simply changing all C,G to 1, other to 0 without intermediate false/true (don't know if ommiting boolean would make it quicker)
1

Deques are great for sliding windows. Try the following:

from collections import deque
from itertools import islice

FN = "temp.txt"
WSIZE = 2

def gen_stream(f):
    for line in f:
        line = line.split()
        yield [1 if line[2] in 'GC' else 0, int(line[3])]

def overlapping():
    with open(FN) as f:
        stream = gen_stream(f)
        window = deque([stream.next() for _ in xrange(WSIZE-1)], WSIZE)
        for row in stream:
            window.append(row)
            print [sum(row[i] for row in window)/float(WSIZE) for i in xrange(2)]

def non_overlapping():
    with open(FN) as f:
        stream = gen_stream(f)
        while True:
            chunk = list(islice(stream, WSIZE))
            if not chunk:
                break
            print [sum(row[i] for row in chunk)/float(len(chunk)) for i in xrange(2)]

This is scalable, i.e., works for a huge file.

4 Comments

cool, great solution. I will look if there aren't any unusal things. This is no overlapping window, right?
I tried to change the window size, and swittched WSIZE only, but I have a feeling, that it isn't enough?
My initial solution was meant for overlapping windows. I just edited the original answer to include a solution for non-overlapping ones. In the latter case, using a deque would bring little advantage
overlapping seems to raise StopIteration for example when WSIZE=3 and stream has only one value in it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.