1

I have a file of patterns (fileA.txt) which need to be searched in a large file (fileB.txt) and they need to be replaced with patterns in another file (fileC.txt) Example:

fileB.txt
4472534
8BC4232
3533221
333553D
8645141
2412AAA

I want to search this patterns in fileB:

fileA.txt
BC423
33221
12AAA

Then I want to replace them with patterns in fileC, line by line:

fileC.txt
66FF7
11GYT
2HHJK

Expected output:

4472534
866FF72
3511GYT
333553D
8645141
242HHJK

I wrote something like this:

grep -f  fileA.txt fileB.txt | xargs sed -i fileC.txt

however, it searches correctly the patterns but the substitution is probably not correct. Any advice?

fileA (pattern to search)
CAAGATTTTCTTTGCCGAGACTCAGTGGGG
fileB
>AMP_4 RS0255 CENPF__ENST00000366955.7__6322__30__0.43333__69.25__1 RS0247
CAGTTGTGCAATTTGGTTTTCCAGCTCACA
>AMP_4 RS0451 CENPF__ENST00000366955.7__10108__30__0.5__71.1396__1 RS0247
GAAGCCTGCAGCCCTCACTGGAAATAAACA
>AMP_4 RS0451 CENPF__ENST00000366955.7__9236__30__0.5__69.816__1 RS0332
CAAGATTTTCTTTGCCGAGACTCAGTGGGG
>AMP_4 RS0451 CENPF__ENST00000366955.7__8140__30__0.43333__68.033__1RS0255
GAGCTCCTTCAATTGATCTTTGCTGCTCTT
fileC (pattern to replace)
GGAGGATGGTGCCTGAATCTACTGGGCTCC
2
  • the patterns I am searching in reality have 30 numbers and letters and they are unique in fileB (already checked) Commented Feb 12, 2021 at 8:49
  • also the patterns in fileA and fileC are unique Commented Feb 12, 2021 at 8:50

5 Answers 5

2

This should be a task for awk, could you please try following written and tested with shown samples in GNU awk.

awk '
FNR==NR{
  arr[$0]=FNR
  next
}
FILENAME=="fileC.txt"{
  arrVal[++count]=$0
  next
}
FILENAME=="fileB.txt"{
  for(key in arr){
    if(sub(key,arrVal[arr[key]])){
      break
    }
  }
  print
}
' fileA.txt fileC.txt fileB.txt

Output will be as follows.

4472534
866FF72
3511GYT
333553D
8645141
242HHJK

Explanation: Adding detailed explanation for above.

awk '                                 ##Starting awk program from here.
FNR==NR{                              ##Checking condition which will be TRUE when fileA.txt is being read.
  arr[$0]=FNR                         ##Creating arr with index of current line and value of current line number.
  next                                ##next will skip all further statements from here.
}
FILENAME=="fileC.txt"{                ##Checking condition if file name is fileC.txt then do following.
  arrVal[++count]=$0                  ##Creating arrVal with index of count increasing value of 1 and having current line as its value.
  next                                ##next will skip all further statements from here.
}
FILENAME=="fileB.txt"{                ##Checking condition if file name is fileB.txt then do
  for(key in arr){                    ##Traversing through array arr here.
    if(sub(key,arrVal[arr[key]])){    ##Checking condition if substitution of arrVal[arr[key]] is successfully done with key in current line, which basically changes the values in fileB values.
      break                           ##Come out of loop to save some cycles.
    }
  }
  print                               ##Printing current line here.
}
' fileA.txt fileC.txt fileB.txt       ##Mentioning Input_file names here.

NOTE: We could also use ARGC conditions check in place of file name checks too in above.

Sign up to request clarification or add additional context in comments.

Comments

1
paste fileA fileC \
|awk 'NR==FNR{ mapping[$1] =$2; next }
             { for(pat in mapping){ 
                   gsub(pat, mapping[pat])
             };
             print
}' - fileB

1 Comment

this is working, the replacement has been done correctly
1

You could use sed to generate a sed script that would replace them:

sed "$(paste fileA.txt fileC.txt | sed 's/\(.*\)\t\(.*\)/s@\1@\2@g/')" fileB.txt

1 Comment

Or perhaps paste -d/ fileA fileC|sed 's#.*#s/&/#' |sed -f - fileB?
1

Here is a one liner with paste + awk + sed:

sed -f <(awk '{printf "s/%s/%s/g\n",$1,$2}' <(paste file{A,C}.txt)) fileB.txt

4472534
866FF72
3511GYT
333553D
8645141
242HHJK

2 Comments

This is working well, as well as the other code, thank you
May I suggest you to test on a huge fileB and see which solution yields faster results.
1

This might work for you (GNU sed & parallel):

parallel echo 's/{1}/{2}/' ::::+ file[AC] | sed -f - fileB

Build a sed script and then run the script with fileB as input.

N.B. ::::+ emulates the paste command and {1} and {2} the values of each line from fileA and fileC.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.