1

I need help in updating a huge csv file with 3.5 Million records. I need to update 3rd column with the mapping value from another file.

I tried reading the file and updating the 3rd column by searching the pattern in mapping file but since the actual file is having 3.5 million and mapping file is having ~1 million records, it seems to be running forever.

E.g.

Actual file:

123,123abc,456_def,456_def_ble,adsf,adsafdsa,123234,45645,435,12,42,afda,3435,wfg,34,345,sergf,5t4
234,234abc,5435_defg,345_def_ble,3adsaff,asdfgdsa,165434,456,435,12,42,afda,3435,wfg,34,345,sergf,5t4

Mapping File:

456_def,24_def
5435_defg,48_defg

Output expected:

123,123abc,24_def,456_def_ble,adsf,adsafdsa,123234,45645,435,12,42,afda,3435,wfg,34,345,sergf,5t4
234,234abc,48_defg,345_def_ble,3adsaff,asdfgdsa,165434,456,435,12,42,afda,3435,wfg,34,345,sergf,5t4
1
  • Updated the question. Commented Mar 2, 2017 at 12:30

2 Answers 2

2

Pretty straight-forward in Awk

awk 'BEGIN{FS=OFS=","}FNR==NR{hash[$1]=$2; next}$3 in hash{$3=hash[$3]}1' mapFile actualFile

produces an output as you needed.

123,123abc,24_def,456_def_ble,adsf,adsafdsa,123234,45645,435,12,42,afda,3435,wfg,34,345,sergf,5t4
234,234abc,48_defg,345_def_ble,3adsaff,asdfgdsa,165434,456,435,12,42,afda,3435,wfg,34,345,sergf,5t4

To speed up things, you can change the locale setting to use ASCII,

Simply put, when using the locale C it will default to the server's base Unix/Linux language of ASCII. By default your locale is going to be internationalized and set to UTF-8, which can represent every character in the Unicode character set to help display any of the world's writing systems, currently over more than 110,000 unique characters, whereas with ASCII each character is encoded in a single byte sequence and its character set comprises of no longer than 128 unique characters. So just do

LC_ALL=C awk 'BEGIN{FS=OFS=","}FNR==NR{hash[$1]=$2; next}$3 in hash{$3=hash[$3]}1' mapFile actualFile
Sign up to request clarification or add additional context in comments.

Comments

1

You can use awk for this:

awk 'BEGIN{FS=OFS=","}       # Set field separator as comma
     NR==FNR{a[$1]=$2;next}  # Store the mapping file into the array a
     {if($3 in a) $3=a[$3]}  # Check if there is match, and change the column value
     1                       # Print the whole line
    ' mapping actualfile

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.