2

There are already a lot of questions like this but neither of them did help me. I want to keep this simple:

I have a file (more than 90 columns) like:

Class,Gene,col3,Class,Gene,col6,Class
A,FF,23,A,FF,16,A
B,GG,45,B,GG,808,B
C,BB,43,C,BB,76,C

I want to keep unique columns so the desired output should be:

Class,Gene,col3,col6
A,FF,23,16
B,GG,45,808
C,BB,43,76

I used awk '!a[$0]++' but it did not remove the repeated columns of the file.

As a side note: I have repetitive columns because I used paste command to join different files column-wise.

6
  • Do you know which fields are repeated, or do you have to determine it dynamically? Commented Jun 29, 2020 at 20:22
  • The command you tried is for removing repeated rows, not columns. Commented Jun 29, 2020 at 20:22
  • Yes, I do. Their headers are identical also rows of them are identical. Commented Jun 29, 2020 at 20:23
  • @Barmar Any help? Commented Jun 29, 2020 at 20:23
  • Do you really want to remove all repeated columns, or just the Class and Gene columns? Should a line like A,FF,1,A,FF,1 be turned into A,FF,1 or A,FF,1,1? Commented Jun 29, 2020 at 20:24

3 Answers 3

4

You may use this awk to print unique columns based on their names in first header row:

awk 'BEGIN {
   FS=OFS=","                        # set input/output field separators as comma
}
NR == 1 {                            # for first header row
   for (i=1; i<=NF; i++)             # loop through all columns
      if (!ucol[$i]++)               # if col name is not in a unique array
         hdr[i]                      # then store column no. in an array hdr
}
{
   for (i=1; i<=NF; i++)             # loop through all columns
      if (i in hdr)                  # if col no. is found in array hdr then print
        printf "%s",(i==1?"":OFS) $i # then print col with OFS
      print ""                       # print line break
}' file
Class,Gene,col3,col6
A,FF,23,16
B,GG,45,808
C,BB,43,76
Sign up to request clarification or add additional context in comments.

4 Comments

Great this worked, many thanks. However, it would be great if you could describe your code a bit then I can get a sense of it :)
May I ask, hdr[i] is short for hdr[i]=i?
@anubhava Thank you so much 🙏
@MichalH: No hdr[i] just stores key i without any value.
2

For your specific case where you're just trying to remove 2 cols added by paste per original file all you need is:

$ awk '
    BEGIN { FS=OFS="," }
    { r=$1 OFS $2; for (i=3; i<=NF; i+=3) r=r OFS $i; print r }
' file
Class,Gene,col3,col6
A,FF,23,16
B,GG,45,808
C,BB,43,76

but in other situations where it's not as simple: create an array (f[] below) that maps output field numbers (determined based on uniqueness of first line field/column names) to the input field numbers then loop through just the output field numbers (note: you don't have to loop through all of the input fields, just the ones that you're going to output) printing the value of the corresponding input field number:

$ cat tst.awk
BEGIN { FS=OFS="," }
NR==1 {
    for (i=1; i<=NF; i++) {
        if ( !seen[$i]++ ) {
            f[++nf] = i
        }
    }
}
{
    for (i=1; i<=nf; i++) {
        printf "%s%s", $(f[i]), (i<nf ? OFS : ORS)
    }
}

.

$ awk -f tst.awk file
Class,Gene,col3,col6
A,FF,23,16
B,GG,45,808
C,BB,43,76

Here's a version with more meaningful variable names and a couple of intermediate variables to clarify what's going on:

BEGIN { FS=OFS="," }
NR==1 {
    numInFlds = NF
    for (inFldNr=1; inFldNr<=numInFlds; inFldNr++) {
        fldName = $inFldNr
        if ( !seen[fldName]++ ) {
            out2in[++numOutFlds] = inFldNr
        }
    }
}
{
    for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
        inFldNr = out2in[outFldNr]
        fldValue = $inFldNr
        printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
    }
}

Comments

1

Print the first two columns and then iterate in strides of 3 to skip the Class and Gene columns in the rest of the row.

awk -F, '{printf("%s,%s", $1, $2); for (i=3; i<=NF; i+=3) printf(",%s", $i); printf("\n")}' 

2 Comments

Unfortunately, this does not give col3, col6, ... headers in the output file.
Needed to use %s instead of %d so it won't convert strings to 0.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.