sorting and counting codons using bash and grep -c [closed]

Question

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

Closed 6 years ago.

Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.

I have a text file which has several lines of codons each line has a set of three nucleotide sequence , it can be either an A,T,G,C but only three of them in a line. (eg. ATC) now, I want to write a while loop that can read these lines and count them and give me the output the codon and the number of times it occurred in the file being the highest to the lowest.

you cant use awk in this loop but using only grep and uniq. Thanks

Why no awk? Is this some kind of a homework? Also, sort would be convenient. — choroba
– choroba, Commented Nov 3, 2019 at 21:08
I want to write Then do it. You can find much help online on how to read a file line by line or like counting unique lines. If you want others to do the job for you, try freelancing sites, where you offer money for others work. — KamilCuk
– KamilCuk, Commented Nov 3, 2019 at 21:11
From your reply, plus the comments below the dash-o answer, your question now seems more complex. Could you please (a) show a simple example of the input (codons, other text) and the output you need, and (b) give some more details as to why exactly would someone only use grep and uniq, when other simpler and equally common tools exist. Especially because any solution with grep + uniq would be probably less efficient and harder for maintainers of your code than sort + uniq (which are very common). Or do you need to simply filter with grep -P '^[ACGT]{3}$' before sort | uniq -c' — Timur Shtatland
– Timur Shtatland, Commented Nov 4, 2019 at 2:57

dash-o · Accepted Answer · 2019-11-04 04:50:42Z

2

You can combine grep (to filter lines that only have ATGC sequences, sort and uniq to count the distinct lines, then extra sort to order highest to lowest

grep '^[ATGC]\+$' | sort | |  uniq -c | sort -k1nr

This will work for reasonable size file (for sure for <1M lines). For larger files, consider awk/Perl/Python solution to avoid the overhead of sorting the complete file.

edited Nov 4, 2019 at 4:50

answered Nov 3, 2019 at 21:13

dash-o

14.6k1 gold badge14 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Dharmanand Ravirajan Over a year ago

Thanks for the reply. I know I can sort and uniq. I dont know how to use grep to search. usually if its a word or pattern then i can use grep -c 'xx'. In my case it could be an A, T, G or C and it can be only three of them per line.

dash-o Over a year ago

Do you mean that there are other lines in the file that need to be filtered fro the sort ?

Dharmanand Ravirajan Over a year ago

Yes. its a text file with several lines. I need to do parsing these and rank the words based on the number of times these words get repeated.

Collectives™ on Stack Overflow

sorting and counting codons using bash and grep -c [closed]

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related