1

I have 2 csv files and I'm looking for a way to compare them using a specific column, and once a match is found I need to take the value of another column from the matched row and put it in the corresponding column of the other record.

I'll try to explain a little bit more.

One csv has product_id,product_name,brand_name,price the other has product_id,product_category,product_name,brand_name,price

I need to compare the 2 files by finding the rows that have a matching product_id value, and once found I need to take the price value from file 1 and put it to the matched record's price in file 2.

After extensive research I've come to the conclusion that this maybe achievable with powershell.

Does anyone have any ideas about how I could do that? Thank you for your time.

2
  • Do you need to automate this or is just one time action? Commented Jul 8, 2013 at 12:37
  • just once. I'm gonna do it again in the future but manually Commented Jul 8, 2013 at 12:40

2 Answers 2

2

Since is just a one time action. you could open the csv files in a spreadsheet (google docs, excel, ...) and do a VLOOKUP. Is easy:

To demonstrate this imagine the following spreadsheet where both csv files are side by side. First from column A to B and the second on column D to F

  |    A       |   B   | C |      D     |         E        |   F  
--+------------+-------+---+------------+------------------+-------
1 | product_id | price |   | product_id | product_category | price
2 |          1 |  29.9 |   |          2 |       SOME CAT 1 | =IFERROR(VLOOKUP(D2;A:B;2;FALSE); "NULL")
3 |          2 |  35.5 |   |          3 |       SOME CAT 2 | =IFERROR(VLOOKUP(D3;A:B;2;FALSE); "NULL")

The VLOOKUP function will search for an exact match of the value of D2 cell on the first column of the region A:B, and return the value from the second column of that region. The iferrorwill return NULL if the VLOOKUP fails.

So in this case on cell F2, will look for the product id "2" (Cell d2) on the column A. It founds the product id "2" in row 3, and return the price "35.5" (being the second row of the range A:B). After all rows have been calculated the result will be:

  |    A       |   B   | C |      D     |         E        |   F  
--+------------+-------+---+------------+------------------+-------
1 | product_id | price |   | product_id | product_category | price
2 |          1 |  29.9 |   |          2 |       SOME CAT 1 | 35.5
3 |          2 |  35.5 |   |          3 |       SOME CAT 2 | NULL
Sign up to request clarification or add additional context in comments.

2 Comments

very interesting, going to try it right now and I'll be back to report. Thank you for such a descriptive answer, it must have taken you some time to write it. I really appreciate it
Hello again. That worked like a charm, I can't thank you enough
1

One could also use awk for this; say you have:

$ cat a.csv 
#product_id,product_name,brand_name,price
1,pname1,bname1,100
10,pname10,bname10,200
20,pname20,bname20,300

$ cat b.csv 
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420

With the "FNR==NR" approach (see e.g. > The Unix shell: comparing two files with awk):

$ awk -F, 'FNR==NR{if(!/^#/){a[$1]=$0;next}}($1 in a){split(a[$1],tmp,",");printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4];}' a.csv b.csv 
10,pcat10,pname10,bname10,200
20,pcat20,pname20,bname20,300

With reading each file into an array (see e.g. Awking it – how to load a file into an array in awk | Tapping away):

$ awk -F, 'BEGIN{while(getline < "a.csv"){if(!/^#/){a[$1]=$0;}}close("a.csv");while(getline < "b.csv"){if($1 in a){split(a[$1],tmp,",");printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4];}}close("b.csv");}' 
10,pcat10,pname10,bname10,200
20,pcat20,pname20,bname20,300

In essence, the two approaches do the same thing:

  • read the first file (a.csv), and store its lines in an associative array a, keyed/indexed by the first field $1 of that line (in this case, product_id);
  • then read the second file (b.csv); and if the first field of each of its lines is found in the array a; then output the first four fields of the current line of b.csv; and the fourth field (price) from the corresponding entry in array a

The difference is, that with the FNR==NR approach, specify the input files on the command line as arguments to awk, and basically you can only identify the first file as "special" so you can store it as an array; with the second approach, each input file could be parsed in a separate array - however, the input files are specified in the awk script itself, not in the arguments to awk - and since then you don't even need to use arguments to awk, the entirety of the awk script needs to happen within a BEGIN{...} block.

When lines are being read from the files, they are automatically split in fields according to -F, command line options, which sets comma as delimiter; however, when retrieving lines stored in the array, we have to split() them separately

Breakdown for the first:

FNR==NR    # if FNR (input record number in the current input file) equals NR (total num records so far)
           # only true when the first file is being read 
{
  if(!/^#/)  # if the current line does not `!` match regex `/.../` of start `^` with `#`
  {
     a[$1]=$0; # assign current line `$0` to array `a`, with index/key being first field in current line `$1`
     next      # skip the rest, and start processing next line
  }
}
               # --this section below executes when FNR does not equal NR;--  
($1 in a)                                      # first, check if first field `$1` of current line is in array `a`
{
  split(a[$1],tmp,",");                          # split entry `a[$1]` at commas into array `tmp`
  printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4];  # print reconstructed current line, 
                                                 # taking the fourth field from the `tmp` array
}  

Breakdown for the second:

BEGIN{ # since no file arguments here, everything goes in BEGIN block
  while(getline < "a.csv"){  # while reading lines from first file 
    if(!/^#/){               # if the current line does not `!` match regex `/.../` of start `^` with `#`
      a[$1]=$0;              # store current line `$0` to array `a`, with index/key being first field in current line `$1`
    }
  }
  close("a.csv");
  while(getline < "b.csv"){  # while reading lines from second file 
    if($1 in a){                                 # first, check if first field `$1` of current line is in array `a`
      split(a[$1],tmp,",");                         # (same as above)
      printf "%d,%s,%s,%s,%d\n",$1,$2,$3,$4,tmp[4]; # (same as above)
    } 
  }
  close("b.csv");
} # end BEGIN

Note about the execution with FNR==NR:

$ awk -F, 'FNR==NR{print "-";} (1){print;}' a.csv b.csv # or:
$ awk -F, 'FNR==NR{print "-";} {print;}' a.csv b.csv 
-
#product_id,product_name,brand_name,price
-
1,pname1,bname1,100
-
10,pname10,bname10,200
-
20,pname20,bname20,300
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420

$ awk -F, 'FNR==NR{print "-";} FNR!=NR{print;}' a.csv b.csv 
-
-
-
-
#product_id,product_category,product_name,brand_name,price
3,pcat3,pname3,bname3,42
10,pcat10,pname10,bname10,199
20,pcat20,pname20,bname20,299
30,pcat10,pname30,bname30,420

That means that the "this section below executes when FNR does not equal NR;" comment above is in principle wrong - even if that is how that particular example ends up behaving.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.