Remove columns using shell commands

Question

I have some big txt files as an inputs which looks like

# USER_IP: 37.1.62.12 INTERFACE CHARMM-GUI
@<TRIPOS>MOLECULE
lig.pdb
54 56 1 0 0
SMALL
NO_CHARGES


@<TRIPOS>ATOM
      1 CAA         2.9880    0.1910   12.9830 C.3       1 P0G    0.0000
      2 CAB         1.3730    1.7370   10.6500 C.3       1 P0G    0.0000
      3 CAC        -0.5820    0.2000   10.5350 C.3       1 P0G    0.0000
      4 OAD        -5.1220    5.7850    8.9220 O.2       1 P0G    0.0000
      5 OAE        -2.7610    6.1960    4.9010 O.3       1 P0G    0.0000
      6 OAF        -0.8620    0.4430    6.3540 O.3       1 P0G    0.0000
      7 CAG         0.7160   -2.5530   14.2490 C.ar      1 P0G    0.0000
      8 CAH         0.1300   -3.0010   13.0720 C.ar      1 P0G    0.0000

...

here in each of file I have a lot of strings:
      6 OAF        -0.8620    0.4430    6.3540 O.3       1 P0G    0.0000
      7 CAG         0.7160   -2.5530   14.2490 C.ar      1 P0G    0.0000
      8 CAH         0.1300   -3.0010   13.0720 C.ar      1 P0G    0.0000

my task is using some Linux shell script and combination of AFK, SED to remove all columns from those fragments with the exception of first 1-5 columns which are relevant for me. So the example file after its processing should be like

# USER_IP: 37.1.62.12 INTERFACE CHARMM-GUI
@<TRIPOS>MOLECULE
lig.pdb
54 56 1 0 0
SMALL
NO_CHARGES


@<TRIPOS>ATOM
      1 CAA         2.9880    0.1910   12.9830 
      2 CAB         1.3730    1.7370   10.6500 
      3 CAC        -0.5820    0.2000   10.5350 
      4 OAD        -5.1220    5.7850    8.9220 
      5 OAE        -2.7610    6.1960    4.9010 
      6 OAF        -0.8620    0.4430    6.3540 
      7 CAG         0.7160   -2.5530   14.2490 
      8 CAH         0.1300   -3.0010   13.0720

the problem here that always in same type of files I have several strings (its number might differ) before those segments which should be processed. So the only idea is to use below string

@<TRIPOS>ATOM

as the reference and start to count strings which columns must be processed only after this reference string

I'd be thankful for several examples and its short explanation

Gleb

Wintermute · Accepted Answer · 2015-03-25 09:34:15Z

3

With GNU awk 4.0 or later:

gawk 'flag { split($0, f, " ", d); for(i = 1; i <= 5; ++i) printf("%s%s", d[i - 1], f[i]); print ""; next } /@<TRIPOS>ATOM/ { flag = 1 } 1' filename

Most of this is to keep the formatting intact; if the formatting does not matter, then

awk 'flag { NF = 5 } /@<TRIPOS>ATOM/ { flag = 1 } 1' filename

is a simpler way that works with older gawk and mawk as well. To make this work with BSD awk,

awk 'flag { NF = 5; $1 = $1 } /@<TRIPOS>ATOM/ { flag = 1 } 1' filename

is necessary ($1 = $1 just to force the rebuilding of the line). Thanks to @tripleee for commenting on this.

The second piece of code simply adjusts the number of fields, causing the line to be rebuilt with less of them. The first does a bit more:

flag {                              # if we're already processing lines
  split($0, f, " ", d)              # split line into array f, save delimiters
                                    # into array d

  for(i = 1; i <= 5; ++i) {         # print the first five fields separated
    printf("%s%s", d[i - 1], f[i])  # by the saved delimiters
  }
  print ""                          # add newline
  next                              # that is all.
}
                                    # if we're not processing lines yet
/@<TRIPOS>ATOM/ { flag = 1 }        # check if we should, and if so set flag
1                                   # then print line unchanged.

Addendum: Another way that also preserves the formatting is to use sed:

sed '1,/@<TRIPOS>ATOM/ ! { s/\b[[:space:]]/\n/5; s/\n.*//; }' filename

That is:

1,/@<TRIPOS>ATOM/ ! {     # For those lines that are not in the range from
                          # the beginning to the first line containing
                          # @<TRIPOS>ATOM

  s/\b[[:space:]]/\n/5    # place a newline after the fifth column
  s/\n.*//                # then remove the newline and everything after it
}

This should work with both GNU sed and BSD sed. Since \b is not part of POSIX basic regexes, though, more esoteric seds may require a slight change:

sed '1,/@<TRIPOS>ATOM/ ! { s/\([^[:space:]]\)[[:space:]]/\1\n/5; s/\n.*//; }' filename

This works essentially the same way but uses a different regex to recognize the end of columns.

edited Mar 25, 2015 at 9:34

answered Mar 24, 2015 at 19:29

Wintermute

44.3k5 gold badges85 silver badges85 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

tripleee Over a year ago

I was not successful with setting NF = 5 on OS X awk. It would be nice if it worked, but it doesn't seem to work portably.

Wintermute Over a year ago

Does it work if you set NF = 5 and $1 = $1? I suspect the line rebuild isn't triggered in BSD awk just by setting NF, but we really want to set NF to avoid superfluous field separators at the end of the output.

Ed Morton Over a year ago

Setting NF will have no effect at all in some awks, regardless of whether or not you reassign $1 to itself, e.g. on Solaris: echo "1 2 3" | /usr/xpg4/bin/awk '{NF=2;$1=$1}1' outputs 1 2 3.

n0741337 Over a year ago

The gawk split() feature is neat. Absent that, I think the formatting could be preserved with printf "%7s%4s%15s%10s%10s\n", $1, $2, $3, $4, $5; next instead of NF = 5.

dinox0r · Accepted Answer · 2015-03-24 19:32:58Z

-1

The following should work:

sed -n '/@<TRIPOS>ATOM/,$p' filename | tail -n +2 | tr -s " " | cut -d" " -f1-5

Work as follows:

Print only the lines after the @<TRIPOS>ATOM:
```
sed -n '/@<TRIPOS>ATOM/,$p' filename
```
Omit the first line (which contains @<TRIPOS>ATOM and you don't want that):
```
tail -n +2
```
Squeeze the extra spaces between columns:
```
tr -s " "
```
cut the columns using space as the delimiter, grab the fields you need:
```
cut -d" " -f1-5
```

answered Mar 24, 2015 at 19:32

dinox0r

16.1k4 gold badges39 silver badges43 bronze badges

1 Comment

tripleee Over a year ago

-One This removes the header completely. The header should be retained, and columns adjusted only after the "tripos" line.

Collectives™ on Stack Overflow

Remove columns using shell commands

2 Answers 2

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related