How to remove identical columns in a csv file using Bash

Question

There are already a lot of questions like this but neither of them did help me. I want to keep this simple:

I have a file (more than 90 columns) like:

Class,Gene,col3,Class,Gene,col6,Class
A,FF,23,A,FF,16,A
B,GG,45,B,GG,808,B
C,BB,43,C,BB,76,C

I want to keep unique columns so the desired output should be:

Class,Gene,col3,col6
A,FF,23,16
B,GG,45,808
C,BB,43,76

I used awk '!a[$0]++' but it did not remove the repeated columns of the file.

As a side note: I have repetitive columns because I used paste command to join different files column-wise.

Do you know which fields are repeated, or do you have to determine it dynamically? — Barmar
– Barmar, Commented Jun 29, 2020 at 20:22
The command you tried is for removing repeated rows, not columns. — Barmar
– Barmar, Commented Jun 29, 2020 at 20:22
Yes, I do. Their headers are identical also rows of them are identical. — Apex
– Apex, Commented Jun 29, 2020 at 20:23
Do you really want to remove all repeated columns, or just the Class and Gene columns? Should a line like A,FF,1,A,FF,1 be turned into A,FF,1 or A,FF,1,1? — Barmar
– Barmar, Commented Jun 29, 2020 at 20:24

anubhava · Accepted Answer · 2020-06-29 20:45:09Z

4

You may use this awk to print unique columns based on their names in first header row:

awk 'BEGIN {
   FS=OFS=","                        # set input/output field separators as comma
}
NR == 1 {                            # for first header row
   for (i=1; i<=NF; i++)             # loop through all columns
      if (!ucol[$i]++)               # if col name is not in a unique array
         hdr[i]                      # then store column no. in an array hdr
}
{
   for (i=1; i<=NF; i++)             # loop through all columns
      if (i in hdr)                  # if col no. is found in array hdr then print
        printf "%s",(i==1?"":OFS) $i # then print col with OFS
      print ""                       # print line break
}' file

Class,Gene,col3,col6
A,FF,23,16
B,GG,45,808
C,BB,43,76

edited Jun 29, 2020 at 20:45

answered Jun 29, 2020 at 20:37

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Apex Over a year ago

Great this worked, many thanks. However, it would be great if you could describe your code a bit then I can get a sense of it :)

MichalH Over a year ago

May I ask, hdr[i] is short for hdr[i]=i?

Apex Over a year ago

@anubhava Thank you so much 🙏

anubhava Over a year ago

@MichalH: No hdr[i] just stores key i without any value.

Ed Morton · Accepted Answer · 2020-06-29 21:42:24Z

For your specific case where you're just trying to remove 2 cols added by paste per original file all you need is:

$ awk '
    BEGIN { FS=OFS="," }
    { r=$1 OFS $2; for (i=3; i<=NF; i+=3) r=r OFS $i; print r }
' file
Class,Gene,col3,col6
A,FF,23,16
B,GG,45,808
C,BB,43,76

but in other situations where it's not as simple: create an array (f[] below) that maps output field numbers (determined based on uniqueness of first line field/column names) to the input field numbers then loop through just the output field numbers (note: you don't have to loop through all of the input fields, just the ones that you're going to output) printing the value of the corresponding input field number:

$ cat tst.awk
BEGIN { FS=OFS="," }
NR==1 {
    for (i=1; i<=NF; i++) {
        if ( !seen[$i]++ ) {
            f[++nf] = i
        }
    }
}
{
    for (i=1; i<=nf; i++) {
        printf "%s%s", $(f[i]), (i<nf ? OFS : ORS)
    }
}

.

$ awk -f tst.awk file
Class,Gene,col3,col6
A,FF,23,16
B,GG,45,808
C,BB,43,76

Here's a version with more meaningful variable names and a couple of intermediate variables to clarify what's going on:

BEGIN { FS=OFS="," }
NR==1 {
    numInFlds = NF
    for (inFldNr=1; inFldNr<=numInFlds; inFldNr++) {
        fldName = $inFldNr
        if ( !seen[fldName]++ ) {
            out2in[++numOutFlds] = inFldNr
        }
    }
}
{
    for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
        inFldNr = out2in[outFldNr]
        fldValue = $inFldNr
        printf "%s%s", fldValue, (outFldNr<numOutFlds ? OFS : ORS)
    }
}

Barmar · Accepted Answer · 2020-06-29 20:49:06Z

1

Print the first two columns and then iterate in strides of 3 to skip the Class and Gene columns in the rest of the row.

awk -F, '{printf("%s,%s", $1, $2); for (i=3; i<=NF; i+=3) printf(",%s", $i); printf("\n")}'

edited Jun 29, 2020 at 20:49

answered Jun 29, 2020 at 20:28

Barmar

789k57 gold badges555 silver badges669 bronze badges

2 Comments

Apex Over a year ago

Unfortunately, this does not give col3, col6, ... headers in the output file.

Barmar Over a year ago

Needed to use %s instead of %d so it won't convert strings to 0.

Collectives™ on Stack Overflow

How to remove identical columns in a csv file using Bash

3 Answers 3

4 Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related