0

I have two files:

seqs.fa:

>seq000007;size=72768;
ACTGTGAG
>seq000010;size=53132;
GTAAGATC
GAATTCTT
>seq00045;size=40321;
ACCCATTT
...  

numbers.txt

72768
53132

my desired output would be the lines from the first file that match a number from the second file:

>seq000007;size=72768;
>seq000010;size=53132;

I attempted to use awk, but it only returns lines matching the first number:

awk -F"\n" -v RS=">" 'NR==FNR{for(i=1;i<=NF;i++) A[$i]; next} END {for (header in A) {if ( match(header,$1) ) {print header}}}'  seqs.fa numbers.txt

seq000007;size=72768;
seq072768;size=1;

Why is awk only looping through the "header" array for the first line in numbers.txt? And, if this is an XY problem, is there a better way to accomplish this goal?

2 Answers 2

2

after fixing the typo in your numbers file

$ awk -F'=|;' 'NR==FNR{a[$1]; next}; $3 in a' numbers.txt seqs.fa

>seq000007;size=72768;
>seq000010;size=53132;
Sign up to request clarification or add additional context in comments.

4 Comments

thanks, edited question to remove typo. this gives the desired output. any ideas why my awk command above doesn't work?
you have to match $1 in header not the other way around, but it's an inefficient approach.
I think that's what I'm doing, the call is match(string, regex) (unlike match functions I'm used to in python) source
right, perhaps it's your record structure then. In your second file there will be only one record.
0

In this special case you can use GNU grep like this:

grep -F -f numbers.txt seqs.fa

The option -f filename uses all the patterns found in filename for the search. The options -F tells grep, that the patterns are simple fixed strings.

1 Comment

note that this will match any occurrence of sub strings in the file.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.