0

I have some big txt files as an inputs which looks like

# USER_IP: 37.1.62.12 INTERFACE CHARMM-GUI
@<TRIPOS>MOLECULE
lig.pdb
54 56 1 0 0
SMALL
NO_CHARGES


@<TRIPOS>ATOM
      1 CAA         2.9880    0.1910   12.9830 C.3       1 P0G    0.0000
      2 CAB         1.3730    1.7370   10.6500 C.3       1 P0G    0.0000
      3 CAC        -0.5820    0.2000   10.5350 C.3       1 P0G    0.0000
      4 OAD        -5.1220    5.7850    8.9220 O.2       1 P0G    0.0000
      5 OAE        -2.7610    6.1960    4.9010 O.3       1 P0G    0.0000
      6 OAF        -0.8620    0.4430    6.3540 O.3       1 P0G    0.0000
      7 CAG         0.7160   -2.5530   14.2490 C.ar      1 P0G    0.0000
      8 CAH         0.1300   -3.0010   13.0720 C.ar      1 P0G    0.0000

...

here in each of file I have a lot of strings:
      6 OAF        -0.8620    0.4430    6.3540 O.3       1 P0G    0.0000
      7 CAG         0.7160   -2.5530   14.2490 C.ar      1 P0G    0.0000
      8 CAH         0.1300   -3.0010   13.0720 C.ar      1 P0G    0.0000

my task is using some Linux shell script and combination of AFK, SED to remove all columns from those fragments with the exception of first 1-5 columns which are relevant for me. So the example file after its processing should be like

# USER_IP: 37.1.62.12 INTERFACE CHARMM-GUI
@<TRIPOS>MOLECULE
lig.pdb
54 56 1 0 0
SMALL
NO_CHARGES


@<TRIPOS>ATOM
      1 CAA         2.9880    0.1910   12.9830 
      2 CAB         1.3730    1.7370   10.6500 
      3 CAC        -0.5820    0.2000   10.5350 
      4 OAD        -5.1220    5.7850    8.9220 
      5 OAE        -2.7610    6.1960    4.9010 
      6 OAF        -0.8620    0.4430    6.3540 
      7 CAG         0.7160   -2.5530   14.2490 
      8 CAH         0.1300   -3.0010   13.0720 

the problem here that always in same type of files I have several strings (its number might differ) before those segments which should be processed. So the only idea is to use below string

@<TRIPOS>ATOM

as the reference and start to count strings which columns must be processed only after this reference string

I'd be thankful for several examples and its short explanation

Gleb

2 Answers 2

3

With GNU awk 4.0 or later:

gawk 'flag { split($0, f, " ", d); for(i = 1; i <= 5; ++i) printf("%s%s", d[i - 1], f[i]); print ""; next } /@<TRIPOS>ATOM/ { flag = 1 } 1' filename

Most of this is to keep the formatting intact; if the formatting does not matter, then

awk 'flag { NF = 5 } /@<TRIPOS>ATOM/ { flag = 1 } 1' filename

is a simpler way that works with older gawk and mawk as well. To make this work with BSD awk,

awk 'flag { NF = 5; $1 = $1 } /@<TRIPOS>ATOM/ { flag = 1 } 1' filename

is necessary ($1 = $1 just to force the rebuilding of the line). Thanks to @tripleee for commenting on this.

The second piece of code simply adjusts the number of fields, causing the line to be rebuilt with less of them. The first does a bit more:

flag {                              # if we're already processing lines
  split($0, f, " ", d)              # split line into array f, save delimiters
                                    # into array d

  for(i = 1; i <= 5; ++i) {         # print the first five fields separated
    printf("%s%s", d[i - 1], f[i])  # by the saved delimiters
  }
  print ""                          # add newline
  next                              # that is all.
}
                                    # if we're not processing lines yet
/@<TRIPOS>ATOM/ { flag = 1 }        # check if we should, and if so set flag
1                                   # then print line unchanged.

Addendum: Another way that also preserves the formatting is to use sed:

sed '1,/@<TRIPOS>ATOM/ ! { s/\b[[:space:]]/\n/5; s/\n.*//; }' filename

That is:

1,/@<TRIPOS>ATOM/ ! {     # For those lines that are not in the range from
                          # the beginning to the first line containing
                          # @<TRIPOS>ATOM

  s/\b[[:space:]]/\n/5    # place a newline after the fifth column
  s/\n.*//                # then remove the newline and everything after it
}

This should work with both GNU sed and BSD sed. Since \b is not part of POSIX basic regexes, though, more esoteric seds may require a slight change:

sed '1,/@<TRIPOS>ATOM/ ! { s/\([^[:space:]]\)[[:space:]]/\1\n/5; s/\n.*//; }' filename

This works essentially the same way but uses a different regex to recognize the end of columns.

Sign up to request clarification or add additional context in comments.

4 Comments

I was not successful with setting NF = 5 on OS X awk. It would be nice if it worked, but it doesn't seem to work portably.
Does it work if you set NF = 5 and $1 = $1? I suspect the line rebuild isn't triggered in BSD awk just by setting NF, but we really want to set NF to avoid superfluous field separators at the end of the output.
Setting NF will have no effect at all in some awks, regardless of whether or not you reassign $1 to itself, e.g. on Solaris: echo "1 2 3" | /usr/xpg4/bin/awk '{NF=2;$1=$1}1' outputs 1 2 3.
The gawk split() feature is neat. Absent that, I think the formatting could be preserved with printf "%7s%4s%15s%10s%10s\n", $1, $2, $3, $4, $5; next instead of NF = 5.
-1

The following should work:

sed -n '/@<TRIPOS>ATOM/,$p' filename | tail -n +2 | tr -s " " | cut -d" " -f1-5

Work as follows:

  1. Print only the lines after the @<TRIPOS>ATOM:

    sed -n '/@<TRIPOS>ATOM/,$p' filename
    
  2. Omit the first line (which contains @<TRIPOS>ATOM and you don't want that):

    tail -n +2
    
  3. Squeeze the extra spaces between columns:

    tr -s " "
    
  4. cut the columns using space as the delimiter, grab the fields you need:

    cut -d" " -f1-5
    

1 Comment

-One This removes the header completely. The header should be retained, and columns adjusted only after the "tripos" line.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.