0

I have file 1:

sample_1    group_1
sample_2    group_1
sample_3    group_1
sample_4    group_2
sample_5    group_2
sample_6    group_2
sample_7    group_3
sample_8    group_3
sample_9    group_3

and file 2:

sample_8    group_3.1
sample_9    group_3.1

I want to replace the rows in column 2, file 1 with the matching rows of file 2, column 1, so the result is:

sample_1    group_1
sample_2    group_1
sample_3    group_1
sample_4    group_2
sample_5    group_2
sample_6    group_2
sample_7    group_3
sample_8    group_3.1
sample_9    group_3.1

The nearest I have got is to do a left join: join -a1 -j 1 -o 1.1,1.2,2.2 <(sort -k1 file_1) <(sort -k1 file_2)

which gives me:

sample_1 group_1 
sample_2 group_1 
sample_3 group_1 
sample_4 group_2 
sample_5 group_2 
sample_6 group_2 
sample_7 group_3 
sample_8 group_3 group_3.1
sample_9 group_3 group_3.1

Then I thought I could drop the second column if the file 1 second column was repeated in the third column, but of course this does not happen.

1
  • Ok, thank you. So where can I appeal duplications? (I'm guessing in the comments here?) I'd be interested to know why you thought it was a duplication, then perhaps I can adapt the code in the other question. Commented Oct 13, 2020 at 20:33

2 Answers 2

2

Here is a way with awk

awk 'FNR==NR {a[$1] = $0; next} ($1 in a) {$0 = a[$1]} 1' file2 file1
sample_1    group_1
sample_2    group_1
sample_3    group_1
sample_4    group_2
sample_5    group_2
sample_6    group_2
sample_7    group_3
sample_8    group_3.1
sample_9    group_3.1

FNR==NR {...; next} is a standard syntax that means a code block only for the first input. Into there we save using as a hash the first field, the whole line: a[$1]=$0

The next is executed for the second input file, for file1: ($1 in a) is a condition that means if the first field exists in the hash. Then {$0=a[$1]} meaning replace the line with the saved line of that array. 1 at the end means to print.


With join.

If you want to use join, probably you have first to get the lines of file1 (this is the -a1 you use currently) then get the joined per first field printed from the second file. Finally sort this again. Here is with commands grouping:

(
    join -v1 -j1 file1 file2
    join -j1 -o 2.1,2.2 file1 file2
) | sort
Sign up to request clarification or add additional context in comments.

3 Comments

@Luther_Blissett I marked the previous one as a duplicate, because various small changes on the above command can do different things. You can use field 1,2,3 or the whole line, replace when field exists or line exists etc. I had no intention to make it difficult for you to find answers for your queries, but there are too many almost duplicates to this same syntax for joining, merging, excluding lines, that's why I also add this answer. You can see that the only thing modified here is the condition (and the numbers of fields) comparing to the linked post.
I also added a solution with join commands. I hope they are helpful and if the linked post was not a duplicate, then accept my apologies, if it was, then never mind also. Remember to use the same question, post a comment there or edit it, it will certainly get the attention of the people and it will get some help.
BTW I see in your question history a lot of white colour, where is the green? Maybe you have to upvote and/or accept any answers in your previous questions that were helpful for you. Cheers.
2

You can do this with an awk script as per the following transcript:

pax:~> cat file1
sample_1 group_1
sample_2 group_1
sample_3 group_1
sample_4 group_2
sample_5 group_2
sample_6 group_2
sample_7 group_3
sample_8 group_3
sample_9 group_3

pax:~> cat file2
sample_8 group_3.1
sample_9 group_3.1

pax:~> awk -f prog.awk file2 file1
sample_1 group_1
sample_2 group_1
sample_3 group_1
sample_4 group_2
sample_5 group_2
sample_6 group_2
sample_7 group_3
sample_8 group_3.1
sample_9 group_3.1

The actual awk script is shown below:

NR == FNR { lookup[$1] = $2; next }
NR != FNR && lookup[$1] != "" { print $1" "lookup[$1]; next }
{ print }

The first line just collects all the lookup values in the first file given, the NR == FNR trick comparing the line in the current input file with the line in the entire set of input files. These are only equal in the first file.

The second line is for subsequent files due to NR and FNR being different. It also checks that the lookup exists. If both those conditions are met, it will output the adjusted input line with the lookup value rather than the original.

The third line just echoes input lines where the lookup doesn't exist.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.