Skip to main content
added 1 character in body
Source Link
terdon
  • 252.7k
  • 69
  • 481
  • 719

Here's one way:

paste file1.fa file2.fa | 
    sed -E 's/\s+>/-/; s/\s+//'g' | 
        awk -v c=0 '{ if(/^>/){c++} print > "file"c".pasted.fa"; }'

To explain this, let's have a look at what each command outputs:

$ paste file1.fa file2.fa 
>ID_000_FLNNKGHD_01376  >ID_000_KGHDAAD_06245 
-ATGAATACAGAGGAAAAAACACCGCTTGCATACAAT   AAATACAGAGGAAAAAACACCGCTTGCATACAAT
>ID_000_MGCDKLCO_02388  >ID_000_KOAAFG_40481 
ATGAAGGTGGAAAAAACACCGCTTGCATTT  CCCCAGGAAGGTGGAAAAAACACCGCTTGCAAA
>ID_000_OMAMOGKP_02746  >ID_000_GPAAAGVV_07764
--ATGTTGGTGGAAAAAACACCGCTTGCGGTA    --AAATTGGTGG---------ACACCGCTTTT--

So this will print each line from each file next to each other. Line 1 from file1 with line 1 from file 2, line 2 from file1 with line 2 from file2 etc. However, it has some extra spaces and an extra > which we need to get rid of. That's what the sed is doing:

$ paste file1.fa file2.fa | sed -E 's/\s+>/-/; s/\s+//' 
>ID_000_FLNNKGHD_01376-ID_000_KGHDAAD_06245
-ATGAATACAGAGGAAAAAACACCGCTTGCATACAATAAATACAGAGGAAAAAACACCGCTTGCATACAAT
>ID_000_MGCDKLCO_02388-ID_000_KOAAFG_40481
ATGAAGGTGGAAAAAACACCGCTTGCATTTCCCCAGGAAGGTGGAAAAAACACCGCTTGCAAA
>ID_000_OMAMOGKP_02746-ID_000_GPAAAGVV_07764
--ATGTTGGTGGAAAAAACACCGCTTGCGGTA--AAATTGGTGG---------ACACCGCTTTT--

The last step, the awk script will:

  • awk -v c=0 : start awk and set the variable c to 0.
  • if(/^>/){c++} : add 1 to the vallue of c every time we find a line that starts with >.
  • print > "file"c".pasted.fa" : print the current line into a file called file, then the current value of c and the .pasted.fa.

The final result when run on your example is:

$ ls *pasted*
file1.pasted.fa  file2.pasted.fa  file3.pasted.fa

$ cat file1.pasted.fa 
>ID_000_FLNNKGHD_01376-ID_000_KGHDAAD_06245
-ATGAATACAGAGGAAAAAACACCGCTTGCATACAATAAATACAGAGGAAAAAACACCGCTTGCATACAAT
$ cat file2.pasted.fa 
>ID_000_MGCDKLCO_02388-ID_000_KOAAFG_40481
ATGAAGGTGGAAAAAACACCGCTTGCATTTCCCCAGGAAGGTGGAAAAAACACCGCTTGCAAA
$ cat file3.pasted.fa 
>ID_000_OMAMOGKP_02746-ID_000_GPAAAGVV_07764
--ATGTTGGTGGAAAAAACACCGCTTGCGGTA--AAATTGGTGG---------ACACCGCTTTT--

Here's one way:

paste file1.fa file2.fa | 
    sed -E 's/\s+>/-/; s/\s+//' | 
        awk -v c=0 '{ if(/^>/){c++} print > "file"c".pasted.fa"; }'

To explain this, let's have a look at what each command outputs:

$ paste file1.fa file2.fa 
>ID_000_FLNNKGHD_01376  >ID_000_KGHDAAD_06245 
-ATGAATACAGAGGAAAAAACACCGCTTGCATACAAT   AAATACAGAGGAAAAAACACCGCTTGCATACAAT
>ID_000_MGCDKLCO_02388  >ID_000_KOAAFG_40481 
ATGAAGGTGGAAAAAACACCGCTTGCATTT  CCCCAGGAAGGTGGAAAAAACACCGCTTGCAAA
>ID_000_OMAMOGKP_02746  >ID_000_GPAAAGVV_07764
--ATGTTGGTGGAAAAAACACCGCTTGCGGTA    --AAATTGGTGG---------ACACCGCTTTT--

So this will print each line from each file next to each other. Line 1 from file1 with line 1 from file 2, line 2 from file1 with line 2 from file2 etc. However, it has some extra spaces and an extra > which we need to get rid of. That's what the sed is doing:

$ paste file1.fa file2.fa | sed -E 's/\s+>/-/; s/\s+//' 
>ID_000_FLNNKGHD_01376-ID_000_KGHDAAD_06245
-ATGAATACAGAGGAAAAAACACCGCTTGCATACAATAAATACAGAGGAAAAAACACCGCTTGCATACAAT
>ID_000_MGCDKLCO_02388-ID_000_KOAAFG_40481
ATGAAGGTGGAAAAAACACCGCTTGCATTTCCCCAGGAAGGTGGAAAAAACACCGCTTGCAAA
>ID_000_OMAMOGKP_02746-ID_000_GPAAAGVV_07764
--ATGTTGGTGGAAAAAACACCGCTTGCGGTA--AAATTGGTGG---------ACACCGCTTTT--

The last step, the awk script will:

  • awk -v c=0 : start awk and set the variable c to 0.
  • if(/^>/){c++} : add 1 to the vallue of c every time we find a line that starts with >.
  • print > "file"c".pasted.fa" : print the current line into a file called file, then the current value of c and the .pasted.fa.

The final result when run on your example is:

$ ls *pasted*
file1.pasted.fa  file2.pasted.fa  file3.pasted.fa

$ cat file1.pasted.fa 
>ID_000_FLNNKGHD_01376-ID_000_KGHDAAD_06245
-ATGAATACAGAGGAAAAAACACCGCTTGCATACAATAAATACAGAGGAAAAAACACCGCTTGCATACAAT
$ cat file2.pasted.fa 
>ID_000_MGCDKLCO_02388-ID_000_KOAAFG_40481
ATGAAGGTGGAAAAAACACCGCTTGCATTTCCCCAGGAAGGTGGAAAAAACACCGCTTGCAAA
$ cat file3.pasted.fa 
>ID_000_OMAMOGKP_02746-ID_000_GPAAAGVV_07764
--ATGTTGGTGGAAAAAACACCGCTTGCGGTA--AAATTGGTGG---------ACACCGCTTTT--

Here's one way:

paste file1.fa file2.fa | 
    sed -E 's/\s+>/-/; s/\s+//g' | 
        awk -v c=0 '{ if(/^>/){c++} print > "file"c".pasted.fa"; }'

To explain this, let's have a look at what each command outputs:

$ paste file1.fa file2.fa 
>ID_000_FLNNKGHD_01376  >ID_000_KGHDAAD_06245 
-ATGAATACAGAGGAAAAAACACCGCTTGCATACAAT   AAATACAGAGGAAAAAACACCGCTTGCATACAAT
>ID_000_MGCDKLCO_02388  >ID_000_KOAAFG_40481 
ATGAAGGTGGAAAAAACACCGCTTGCATTT  CCCCAGGAAGGTGGAAAAAACACCGCTTGCAAA
>ID_000_OMAMOGKP_02746  >ID_000_GPAAAGVV_07764
--ATGTTGGTGGAAAAAACACCGCTTGCGGTA    --AAATTGGTGG---------ACACCGCTTTT--

So this will print each line from each file next to each other. Line 1 from file1 with line 1 from file 2, line 2 from file1 with line 2 from file2 etc. However, it has some extra spaces and an extra > which we need to get rid of. That's what the sed is doing:

$ paste file1.fa file2.fa | sed -E 's/\s+>/-/; s/\s+//' 
>ID_000_FLNNKGHD_01376-ID_000_KGHDAAD_06245
-ATGAATACAGAGGAAAAAACACCGCTTGCATACAATAAATACAGAGGAAAAAACACCGCTTGCATACAAT
>ID_000_MGCDKLCO_02388-ID_000_KOAAFG_40481
ATGAAGGTGGAAAAAACACCGCTTGCATTTCCCCAGGAAGGTGGAAAAAACACCGCTTGCAAA
>ID_000_OMAMOGKP_02746-ID_000_GPAAAGVV_07764
--ATGTTGGTGGAAAAAACACCGCTTGCGGTA--AAATTGGTGG---------ACACCGCTTTT--

The last step, the awk script will:

  • awk -v c=0 : start awk and set the variable c to 0.
  • if(/^>/){c++} : add 1 to the vallue of c every time we find a line that starts with >.
  • print > "file"c".pasted.fa" : print the current line into a file called file, then the current value of c and the .pasted.fa.

The final result when run on your example is:

$ ls *pasted*
file1.pasted.fa  file2.pasted.fa  file3.pasted.fa

$ cat file1.pasted.fa 
>ID_000_FLNNKGHD_01376-ID_000_KGHDAAD_06245
-ATGAATACAGAGGAAAAAACACCGCTTGCATACAATAAATACAGAGGAAAAAACACCGCTTGCATACAAT
$ cat file2.pasted.fa 
>ID_000_MGCDKLCO_02388-ID_000_KOAAFG_40481
ATGAAGGTGGAAAAAACACCGCTTGCATTTCCCCAGGAAGGTGGAAAAAACACCGCTTGCAAA
$ cat file3.pasted.fa 
>ID_000_OMAMOGKP_02746-ID_000_GPAAAGVV_07764
--ATGTTGGTGGAAAAAACACCGCTTGCGGTA--AAATTGGTGG---------ACACCGCTTTT--
Source Link
terdon
  • 252.7k
  • 69
  • 481
  • 719

Here's one way:

paste file1.fa file2.fa | 
    sed -E 's/\s+>/-/; s/\s+//' | 
        awk -v c=0 '{ if(/^>/){c++} print > "file"c".pasted.fa"; }'

To explain this, let's have a look at what each command outputs:

$ paste file1.fa file2.fa 
>ID_000_FLNNKGHD_01376  >ID_000_KGHDAAD_06245 
-ATGAATACAGAGGAAAAAACACCGCTTGCATACAAT   AAATACAGAGGAAAAAACACCGCTTGCATACAAT
>ID_000_MGCDKLCO_02388  >ID_000_KOAAFG_40481 
ATGAAGGTGGAAAAAACACCGCTTGCATTT  CCCCAGGAAGGTGGAAAAAACACCGCTTGCAAA
>ID_000_OMAMOGKP_02746  >ID_000_GPAAAGVV_07764
--ATGTTGGTGGAAAAAACACCGCTTGCGGTA    --AAATTGGTGG---------ACACCGCTTTT--

So this will print each line from each file next to each other. Line 1 from file1 with line 1 from file 2, line 2 from file1 with line 2 from file2 etc. However, it has some extra spaces and an extra > which we need to get rid of. That's what the sed is doing:

$ paste file1.fa file2.fa | sed -E 's/\s+>/-/; s/\s+//' 
>ID_000_FLNNKGHD_01376-ID_000_KGHDAAD_06245
-ATGAATACAGAGGAAAAAACACCGCTTGCATACAATAAATACAGAGGAAAAAACACCGCTTGCATACAAT
>ID_000_MGCDKLCO_02388-ID_000_KOAAFG_40481
ATGAAGGTGGAAAAAACACCGCTTGCATTTCCCCAGGAAGGTGGAAAAAACACCGCTTGCAAA
>ID_000_OMAMOGKP_02746-ID_000_GPAAAGVV_07764
--ATGTTGGTGGAAAAAACACCGCTTGCGGTA--AAATTGGTGG---------ACACCGCTTTT--

The last step, the awk script will:

  • awk -v c=0 : start awk and set the variable c to 0.
  • if(/^>/){c++} : add 1 to the vallue of c every time we find a line that starts with >.
  • print > "file"c".pasted.fa" : print the current line into a file called file, then the current value of c and the .pasted.fa.

The final result when run on your example is:

$ ls *pasted*
file1.pasted.fa  file2.pasted.fa  file3.pasted.fa

$ cat file1.pasted.fa 
>ID_000_FLNNKGHD_01376-ID_000_KGHDAAD_06245
-ATGAATACAGAGGAAAAAACACCGCTTGCATACAATAAATACAGAGGAAAAAACACCGCTTGCATACAAT
$ cat file2.pasted.fa 
>ID_000_MGCDKLCO_02388-ID_000_KOAAFG_40481
ATGAAGGTGGAAAAAACACCGCTTGCATTTCCCCAGGAAGGTGGAAAAAACACCGCTTGCAAA
$ cat file3.pasted.fa 
>ID_000_OMAMOGKP_02746-ID_000_GPAAAGVV_07764
--ATGTTGGTGGAAAAAACACCGCTTGCGGTA--AAATTGGTGG---------ACACCGCTTTT--