0

I have two CSV files that share similar headers:sample_scv_1.csv is::

Transaction_date,Name,Payment_Type,Product
1/2/09 6:17,NA,Mastercard,NA
1/2/09 4:53,NA,Visa,NA
1/2/09 13:08,Nick,Mastercard,NA
1/3/09 14:44,Larry,Visa,Goods
1/4/09 12:56,Tina,Visa,Services
1/4/09 13:19,Harry,Visa,Goods

Similarly, sample_scv_2.csv is ::

Transaction_date,Product,Name
1/2/09 6:17,Goods,Janis
1/2/09 4:53,Services,Nicola
1/2/09 13:08,Materials,Asuman

Here in these two files Columns/Fields Transaction_date, Product, Name are common and I want to replace fields Product, Name in sample_scv_1.csv iff the transaction date matches in both the files.

This is a toy example and my file is big. For this example I can separate the cases where columns are equal and use indices to replace using csvtool as:

head -4 sample_scv_1.csv > temp1.csv
tail -3 sample_scv_1.csv > temp1_1.csv
#sudo apt-get install csvtool
csvtool pastecol 2,4 3,2 temp1.csv sample_scv_2.csv > temp1_2.txt
cat temp1_2.txt temp1_1.csv > sample_scv_1.csv

My required output is ::

Transaction_date,Name,Payment_Type,Product
1/2/09 6:17,Janis,Mastercard,Goods
1/2/09 4:53,Nicola,Visa,Services
1/2/09 13:08,Asuman,Mastercard,Materials
1/3/09 14:44,Larry,Visa,Goods
1/4/09 12:56,Tina,Visa,Services
1/4/09 13:19,Harry,Visa,Goods

I can determine until which line the transaction date matches but I can not know the indexes where the two columns overlap: like Name and Product in first file. One issue is easy as all columns of sample_scv_2.csv will be in sample_scv_1.csv. Any ways to do this efficiently.

7
  • Please let us know what you have tried. Most of us here are happy to help you improve your craft, but are less happy acting as short order unpaid programming staff. Show us your work so far in an MCVE, the result you were expecting and the results you got from the attempt you made to solve this yourself, and we'll help you figure it out. Commented Oct 7, 2016 at 2:04
  • [your] file is [are?] big How big? Commented Oct 7, 2016 at 2:57
  • @ghoti : Thanks. However, I have shown an example of what I have tried with the csvtool above. I haven't mentioned others for brevity. Commented Oct 7, 2016 at 3:21
  • @JamesBrown : My data has around 350 columns and 500k rows. Commented Oct 7, 2016 at 3:22
  • Both files are the same size? Commented Oct 7, 2016 at 3:22

1 Answer 1

1

As the files are not bigger than that the file with less columns or fields fits in the memory, so a solution in awk:

$ cat program.awk
BEGIN {FS=OFS=","}         # set the file separators
NR==FNR {                  # for the first file
    p[$1]=$2               # store the product, use date as key
    n[$1]=$3               # name
    next                   # no more processing for the first file
} 
$1 in p {                  # if date found in first processed file
    if($2=="NA") $2=n[$1]  # replace NA with name
    if($4=="NA") $4=p[$1]  # replace NA with product
} 1                        # print the record

Run it:

awk -f program.awk file2 file1
Transaction_date,Name,Payment_Type,Product
1/2/09 6:17 Janis Mastercard Goods
1/2/09 4:53 Nicola Visa Services
1/2/09 13:08 Nick Mastercard Materials
1/3/09 14:44,Larry,Visa,Goods
1/4/09 12:56,Tina,Visa,Services
1/4/09 13:19,Harry,Visa,Goods
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! I want to replace everything not only cases where there are NAs. Plus, the solution assumes we know the indexes where we need replacements but that is little difficult in file with 350 columns. Can it be generalized?
You can store every value from file2 to memory and use that to replace fields in file1. You need to know the indexes to match the records. I mostly use awk, magic not so often. There has to be something to compare.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.