In file B find patterns from file A and replace with patterns from file C, line by line

Question

I have a file of patterns (fileA.txt) which need to be searched in a large file (fileB.txt) and they need to be replaced with patterns in another file (fileC.txt) Example:

fileB.txt
4472534
8BC4232
3533221
333553D
8645141
2412AAA

I want to search this patterns in fileB:

fileA.txt
BC423
33221
12AAA

Then I want to replace them with patterns in fileC, line by line:

fileC.txt
66FF7
11GYT
2HHJK

Expected output:

4472534
866FF72
3511GYT
333553D
8645141
242HHJK

I wrote something like this:

grep -f  fileA.txt fileB.txt | xargs sed -i fileC.txt

however, it searches correctly the patterns but the substitution is probably not correct. Any advice?

fileA (pattern to search)
CAAGATTTTCTTTGCCGAGACTCAGTGGGG
fileB
>AMP_4 RS0255 CENPF__ENST00000366955.7__6322__30__0.43333__69.25__1 RS0247
CAGTTGTGCAATTTGGTTTTCCAGCTCACA
>AMP_4 RS0451 CENPF__ENST00000366955.7__10108__30__0.5__71.1396__1 RS0247
GAAGCCTGCAGCCCTCACTGGAAATAAACA
>AMP_4 RS0451 CENPF__ENST00000366955.7__9236__30__0.5__69.816__1 RS0332
CAAGATTTTCTTTGCCGAGACTCAGTGGGG
>AMP_4 RS0451 CENPF__ENST00000366955.7__8140__30__0.43333__68.033__1RS0255
GAGCTCCTTCAATTGATCTTTGCTGCTCTT
fileC (pattern to replace)
GGAGGATGGTGCCTGAATCTACTGGGCTCC

the patterns I am searching in reality have 30 numbers and letters and they are unique in fileB (already checked) — Paolo Lorenzini
– Paolo Lorenzini, Commented Feb 12, 2021 at 8:49

RavinderSingh13 · Accepted Answer · 2021-02-12 10:25:44Z

This should be a task for awk, could you please try following written and tested with shown samples in GNU awk.

awk '
FNR==NR{
  arr[$0]=FNR
  next
}
FILENAME=="fileC.txt"{
  arrVal[++count]=$0
  next
}
FILENAME=="fileB.txt"{
  for(key in arr){
    if(sub(key,arrVal[arr[key]])){
      break
    }
  }
  print
}
' fileA.txt fileC.txt fileB.txt

Output will be as follows.

4472534
866FF72
3511GYT
333553D
8645141
242HHJK

Explanation: Adding detailed explanation for above.

awk '                                 ##Starting awk program from here.
FNR==NR{                              ##Checking condition which will be TRUE when fileA.txt is being read.
  arr[$0]=FNR                         ##Creating arr with index of current line and value of current line number.
  next                                ##next will skip all further statements from here.
}
FILENAME=="fileC.txt"{                ##Checking condition if file name is fileC.txt then do following.
  arrVal[++count]=$0                  ##Creating arrVal with index of count increasing value of 1 and having current line as its value.
  next                                ##next will skip all further statements from here.
}
FILENAME=="fileB.txt"{                ##Checking condition if file name is fileB.txt then do
  for(key in arr){                    ##Traversing through array arr here.
    if(sub(key,arrVal[arr[key]])){    ##Checking condition if substitution of arrVal[arr[key]] is successfully done with key in current line, which basically changes the values in fileB values.
      break                           ##Come out of loop to save some cycles.
    }
  }
  print                               ##Printing current line here.
}
' fileA.txt fileC.txt fileB.txt       ##Mentioning Input_file names here.

NOTE: We could also use ARGC conditions check in place of file name checks too in above.

αғsнιη · Accepted Answer · 2021-02-12 10:00:37Z

1

paste fileA fileC \
|awk 'NR==FNR{ mapping[$1] =$2; next }
             { for(pat in mapping){ 
                   gsub(pat, mapping[pat])
             };
             print
}' - fileB

answered Feb 12, 2021 at 10:00

αғsнιη

2,8012 gold badges30 silver badges42 bronze badges

1 Comment

Paolo Lorenzini Over a year ago

this is working, the replacement has been done correctly

KamilCuk · Accepted Answer · 2021-02-12 10:04:36Z

1

You could use sed to generate a sed script that would replace them:

sed "$(paste fileA.txt fileC.txt | sed 's/\(.*\)\t\(.*\)/s@\1@\2@g/')" fileB.txt

answered Feb 12, 2021 at 10:04

KamilCuk

146k8 gold badges86 silver badges155 bronze badges

1 Comment

potong Over a year ago

Or perhaps paste -d/ fileA fileC|sed 's#.*#s/&/#' |sed -f - fileB?

anubhava · Accepted Answer · 2021-02-12 11:33:15Z

1

Here is a one liner with paste + awk + sed:

sed -f <(awk '{printf "s/%s/%s/g\n",$1,$2}' <(paste file{A,C}.txt)) fileB.txt

4472534
866FF72
3511GYT
333553D
8645141
242HHJK

answered Feb 12, 2021 at 11:33

anubhava

790k67 gold badges603 silver badges671 bronze badges

2 Comments

Paolo Lorenzini Over a year ago

This is working well, as well as the other code, thank you

anubhava Over a year ago

May I suggest you to test on a huge fileB and see which solution yields faster results.

potong · Accepted Answer · 2021-02-12 15:41:15Z

1

This might work for you (GNU sed & parallel):

parallel echo 's/{1}/{2}/' ::::+ file[AC] | sed -f - fileB

Build a sed script and then run the script with fileB as input.

N.B. ::::+ emulates the paste command and {1} and {2} the values of each line from fileA and fileC.

answered Feb 12, 2021 at 15:41

potong

59.3k6 gold badges55 silver badges92 bronze badges

Collectives™ on Stack Overflow

In file B find patterns from file A and replace with patterns from file C, line by line

5 Answers 5

Comments

1 Comment

1 Comment

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

1 Comment

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related