1

I am working with sequence DNA data, and I would like to count the frequency of each letter (A,C,G,T) on each sequence in my dataset.

For doing so, I have tried the following using Counter method from Collections package, with good results:

df = []
for seq in pseudomona.sequence_DNA:
    df.append(Counter(seq))

[Counter({'C': 2156779, 'A': 1091782, 'G': 2143630, 'T': 1090617}),
 Counter({'T': 1050880, 'G': 2083283, 'C': 2101448, 'A': 1055877}),
 Counter({'C': 2180966, 'A': 1111267, 'G': 2176873, 'T': 1108010}),
 Counter({'C': 2196325, 'G': 2204478, 'A': 1128017, 'T': 1123038}),
 Counter({'T': 1117153, 'C': 2176409, 'A': 1115003, 'G': 2194606}),
 Counter({'G': 2054304, 'A': 1026830, 'T': 1044090, 'C': 2020029})]

However, I do obtain a list of Counter instances (sorry if that's not the right terminology) and I would like to have a sorted data frame with those frequencies like, for instance:

A C G T
2237 4415 124 324
4565 8567 3776 623

I have tried to convert it into a list of lists but then I can not figure out how to transform it into a pandas Dataframe:

[list(items.items()) for items in df]

[[('C', 2156779), ('A', 1091782), ('G', 2143630), ('T', 1090617)],
 [('T', 1050880), ('G', 2083283), ('C', 2101448), ('A', 1055877)],
 [('C', 2180966), ('A', 1111267), ('G', 2176873), ('T', 1108010)],
 [('C', 2196325), ('G', 2204478), ('A', 1128017), ('T', 1123038)],
 [('T', 1117153), ('C', 2176409), ('A', 1115003), ('G', 2194606)],
 [('G', 2054304), ('A', 1026830), ('T', 1044090), ('C', 2020029)]]

It might be something foolish, but I can't figure out how to do it properly. Hope someone has the right clue! :)

2 Answers 2

2

Make a series out of each, and use pd.concat with axis, and tranpose:

df = pd.concat([pd.Series(c) for c in l], axis=1).T

Output:

>>> df
         C        A        G        T
0  2156779  1091782  2143630  1090617
1  2101448  1055877  2083283  1050880
2  2180966  1111267  2176873  1108010
3  2196325  1128017  2204478  1123038
4  2176409  1115003  2194606  1117153
5  2020029  1026830  2054304  1044090
Sign up to request clarification or add additional context in comments.

Comments

1

The Counters can be used the same way a list of dict could be used with DataFrame.from_records:

df = pd.DataFrame.from_records(lst)

df:

         C        A        G        T
0  2156779  1091782  2143630  1090617
1  2101448  1055877  2083283  1050880
2  2180966  1111267  2176873  1108010
3  2196325  1128017  2204478  1123038
4  2176409  1115003  2194606  1117153
5  2020029  1026830  2054304  1044090

columns can be specified in case there are extra/missing keys and/or to specify the order:

df = pd.DataFrame.from_records(lst, columns=['A', 'C', 'G', 'T'])

df:

         A        C        G        T
0  1091782  2156779  2143630  1090617
1  1055877  2101448  2083283  1050880
2  1111267  2180966  2176873  1108010
3  1128017  2196325  2204478  1123038
4  1115003  2176409  2194606  1117153
5  1026830  2020029  2054304  1044090

Setup:

from collections import Counter

import pandas as pd

lst = [Counter({'C': 2156779, 'A': 1091782, 'G': 2143630, 'T': 1090617}),
       Counter({'T': 1050880, 'G': 2083283, 'C': 2101448, 'A': 1055877}),
       Counter({'C': 2180966, 'A': 1111267, 'G': 2176873, 'T': 1108010}),
       Counter({'C': 2196325, 'G': 2204478, 'A': 1128017, 'T': 1123038}),
       Counter({'T': 1117153, 'C': 2176409, 'A': 1115003, 'G': 2194606}),
       Counter({'G': 2054304, 'A': 1026830, 'T': 1044090, 'C': 2020029})]

1 Comment

Good answer, better than mine.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.