2

I am trying to merge the contents of multiple files based on a key matching with awk, I have seen solutions only for two input files, but not more. The input files look like this:

file1

1#a1
2#b1
3#c1
4#d1
6#f1

file2

1#a2
2#b2
3#c2
5#e2
6#f2

file3

1#a3#extra_field_1
2#b3#extra_field_2
3#c3#extra_field_3
4#d3#extra_field_4
5#e3#extra_field_5

The desired output is the following:

output

a1;a2;a3;extra_field_1
b1;b2;b3;extra_field_2
c1;c2;c3;extra_field_3
d1;;d3;extra_field_4
;e2;3e;extra_field_5

For this, I am using a bash script based on awk command like the following:

$ awk -v OFS=';' -F '#' 'FNR==NR{a[$1]=$2;next} FNR!=NR{b[$1]=$2;next} NF==3{print a[$1],b[$1],$2,$3}' file1 file2 file3 > output

Anyway, it seems to obviate some of the inputs because it doesn't produce any output, any ideas?

Thanks.

4 Answers 4

2

You could do that using just the join command

join -t\# file1 file2 -j 1 |\
    join -t\# - file3 -j 1 |\
    cut -d\# --output-delimiter=\; -f2-5

Outputs

a1;a2;a3;extra_field_1
b1;b2;b3;extra_field_2
c1;c2;c3;extra_field_3
Sign up to request clarification or add additional context in comments.

1 Comment

Could be a nice approach, simplify inputs with join, thanks for the tip. Anyway, it's hard to codify more complex logic with this command.
1

Here's one in awk. It doesn't take missing data into consideration as you did not state in the question how it should be handled. It hashes all data into a hash and outputs it in the END:

$ awk '
BEGIN { FS="#"; OFS=";" }
{
    for(i=2;i<=NF;i++)
        a[$1]=a[$1] (a[$1]==""?"":OFS) $i
}
END {
    for(i in a)
        print a[i]
}' f1 f2 f3
a1;a2;a3;extra_field_1
b1;b2;b3;extra_field_2
c1;c2;c3;extra_field_3

2 Comments

I guess my example could have been more exhaustive. In fact, the desired output would gather each record of the third file, adding every 2nd field of the other two files with matching keys.
Sure. As there is an infinite amount of questions, instead of us guessing the one for you, you tend to get better results if you provide us with the facts.
1

One more way using paste and awk:

paste -d"#" file1 file2 file3 | awk -F"#" '{print $2,$4,$6,$7}' OFS=";"

Comments

0

This solution merges two or more files and fills missing/blank fields with "NA" (requires GNU awk):

awk 'BEGIN {
        FS = OFS = "#"
        PROCINFO["sorted_in"] = "@val_str_asc"
}

FNR == 1 {
        filecount++
        numfields[filecount] = NF
        if (NR == 1) {
                a = split($0, header, FS)
        } else {
                for (i = 2; i <= NF; i++) {
                        header[++a] = $i
                }
        }
}

FNR > 1 {
        for (j = 2; j <= NF; j++) {
                b[$1][filecount, j] = $j
        }
}

END {
        for (k = 1; k <= length(header); k++) {
                printf "%s%s", header[k], ((k < length(header)) ? OFS : ORS)
        }
        for (l in b) {
                printf "%s", l OFS
                for (m = 1; m <= filecount; m++) {
                        for (n = 2; n <= numfields[m]; n++) {
                                printf "%s%s",
                                (b[l][m, n] == "" ? "NA" : b[l][m, n]),
                                ((m + n < filecount + numfields[m]) ? OFS : ORS)
                        }
                }
        }
}' file*
1#a1#a2#a3#extra_field_1
2#b1#b2#b3#extra_field_2
3#c1#c2#c3#extra_field_3
4#d1#NA#d3#extra_field_4
5#NA#e2#e3#extra_field_5
6#f1#f2#NA#NA

Different example data:

head file*
==> file1 <==
ID,Value
A1,10
A2,20
A3,30
A4,40

==> file2 <==
ID,Score,Extra
A2,200,True
A1,100,False

==> file3 <==
ID,Evaluation
A1,Correct
A3,Incorrect

==> file4 <==
ID,Value1,Value2,Value3,Value4
A1,,1,1
A2,3,3,3,3

awk 'BEGIN {
        FS = OFS = ","
        PROCINFO["sorted_in"] = "@val_str_asc"
}

FNR == 1 {
        filecount++
        numfields[filecount] = NF
        if (NR == 1) {
                a = split($0, header, FS)
        } else {
                for (i = 2; i <= NF; i++) {
                        header[++a] = $i
                }
        }
}

FNR > 1 {
        for (j = 2; j <= NF; j++) {
                b[$1][filecount, j] = $j
        }
}

END {
        for (k = 1; k <= length(header); k++) {
                printf "%s%s", header[k], ((k < length(header)) ? OFS : ORS)
        }
        for (l in b) {
                printf "%s", l OFS
                for (m = 1; m <= filecount; m++) {
                        for (n = 2; n <= numfields[m]; n++) {
                                printf "%s%s",
                                (b[l][m, n] == "" ? "NA" : b[l][m, n]),
                                ((m + n < filecount + numfields[m]) ? OFS : ORS)
                        }
                }
        }
}' file1 file2 file3 file4
ID,Value,Score,Extra,Evaluation,Value1,Value2,Value3,Value4
A1,10,100,False,Correct,NA,1,1,NA
A2,20,200,True,NA,3,3,3,3
A3,30,NA,NA,Incorrect,NA,NA,NA,NA
A4,40,NA,NA,NA,NA,NA,NA,NA

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.