One could also use awk for this; say you have:
$ cat a.csv
#product_id,product_name,brand_name,price
1,pname1,bname1,100
10,pname10,bname10,200
20,pname20,bname20,300
$ cat b.csv
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420
With the "FNR==NR" approach (see e.g. > The Unix shell: comparing two files with awk):
$ awk -F, 'FNR==NR{if(!/^#/){a[$1]=$0;next}}($1 in a){split(a[$1],tmp,",");printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4];}' a.csv b.csv
10,pcat10,pname10,bname10,200
20,pcat20,pname20,bname20,300
With reading each file into an array (see e.g. Awking it – how to load a file into an array in awk | Tapping away):
$ awk -F, 'BEGIN{while(getline < "a.csv"){if(!/^#/){a[$1]=$0;}}close("a.csv");while(getline < "b.csv"){if($1 in a){split(a[$1],tmp,",");printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4];}}close("b.csv");}'
10,pcat10,pname10,bname10,200
20,pcat20,pname20,bname20,300
In essence, the two approaches do the same thing:
- read the first file (
a.csv), and store its lines in an associative array a, keyed/indexed by the first field $1 of that line (in this case, product_id);
- then read the second file (
b.csv); and if the first field of each of its lines is found in the array a; then output the first four fields of the current line of b.csv; and the fourth field (price) from the corresponding entry in array a
The difference is, that with the FNR==NR approach, specify the input files on the command line as arguments to awk, and basically you can only identify the first file as "special" so you can store it as an array; with the second approach, each input file could be parsed in a separate array - however, the input files are specified in the awk script itself, not in the arguments to awk - and since then you don't even need to use arguments to awk, the entirety of the awk script needs to happen within a BEGIN{...} block.
When lines are being read from the files, they are automatically split in fields according to -F, command line options, which sets comma as delimiter; however, when retrieving lines stored in the array, we have to split() them separately
Breakdown for the first:
FNR==NR # if FNR (input record number in the current input file) equals NR (total num records so far)
# only true when the first file is being read
{
if(!/^#/) # if the current line does not `!` match regex `/.../` of start `^` with `#`
{
a[$1]=$0; # assign current line `$0` to array `a`, with index/key being first field in current line `$1`
next # skip the rest, and start processing next line
}
}
# --this section below executes when FNR does not equal NR;--
($1 in a) # first, check if first field `$1` of current line is in array `a`
{
split(a[$1],tmp,","); # split entry `a[$1]` at commas into array `tmp`
printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4]; # print reconstructed current line,
# taking the fourth field from the `tmp` array
}
Breakdown for the second:
BEGIN{ # since no file arguments here, everything goes in BEGIN block
while(getline < "a.csv"){ # while reading lines from first file
if(!/^#/){ # if the current line does not `!` match regex `/.../` of start `^` with `#`
a[$1]=$0; # store current line `$0` to array `a`, with index/key being first field in current line `$1`
}
}
close("a.csv");
while(getline < "b.csv"){ # while reading lines from second file
if($1 in a){ # first, check if first field `$1` of current line is in array `a`
split(a[$1],tmp,","); # (same as above)
printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4]; # (same as above)
}
}
close("b.csv");
} # end BEGIN
Note about the execution with FNR==NR:
$ awk -F, 'FNR==NR{print "-";} (1){print;}' a.csv b.csv # or:
$ awk -F, 'FNR==NR{print "-";} {print;}' a.csv b.csv
-
#product_id,product_name,brand_name,price
-
1,pname1,bname1,100
-
10,pname10,bname10,200
-
20,pname20,bname20,300
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420
$ awk -F, 'FNR==NR{print "-";} FNR!=NR{print;}' a.csv b.csv
-
-
-
-
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420
That means that the "this section below executes when FNR does not equal NR;" comment above is in principle wrong - even if that is how that particular example ends up behaving.