count duplicate entries within each column

Question

My file contains a hundred of columns.

I need to:

first counting duplicates entries within each column
then display output columns order by the number of rows (max to min).

Input file:

a     b     c     d     e
11    11    22    56    11
11    44    56    89    11
12    56    78    91    11 
22    60    78          11
22    60    91
      60    98
      91
      91
      95

Output file:

b       c       a       d       e
11      22      11(2)   56      11(4)
44      56      12      89
56      78(2)   22(2)   91
60(3)   91
91(2)   98
95

Ed Morton · Accepted Answer · 2022-02-12 19:00:35Z

With GNU awk fo arrays of arrays and sorted_in:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR == 1 {
    numCols = split($0,tags)
    next
}
{
    for ( colNr=1; colNr<=NF; colNr++ ) {
        val = $colNr
        if ( val != "" ) {
            if ( !seen[colNr][val]++ ) {
                ++colRowNrs[colNr]
            }
            rowNr = colRowNrs[colNr]
            numRows = ( rowNr > numRows ? rowNr : numRows )
            rowColVals[rowNr][colNr] = val
            rowColCnts[rowNr][colNr]++
        }
    }
}
END {
    PROCINFO["sorted_in"] = "@val_num_desc"
    for ( colNr in colRowNrs ) {
        tag = tags[colNr]
        printf "%s%s", tag, (colNr<numCols ? OFS : ORS)
    }
    for ( rowNr=1; rowNr<=numRows; rowNr++ ) {
        for ( colNr in colRowNrs ) {
            val = rowColVals[rowNr][colNr]
            cnt = rowColCnts[rowNr][colNr]
            printf "%s%s%s", val, (cnt > 1 ? "("cnt")" : ""), (colNr<numCols ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
b       c       a       d       e
11      22      11(2)   56      11(4)
44      56      12      89
56      78(2)   22(2)   91
60(3)   91
91(2)   98
95

The above assumes your input is tab-separated - if that's wrong then edit your question to clarify.

guest_7 · Accepted Answer · 2022-02-11 08:19:05Z

One way using the linux utilities is shown here.

if="$PWD/file"
tmp=$(mktemp -d) || exit
cd -- "$tmp"

nf=$(awk -F '\t' '{print NF;exit}' "$if")

printf -v fmt '%%0%dd\n' "$(expr "$nf" : '.*')"

for i in $(seq "$nf"); do
  cut -f"$i" "$if" | 
  grep -E '[^[:space:]]' |
  uniq -c | awk '{$0 = $2 \
   ($1>1 ? "("$1")" : "")}1' \
  > "$(printf "$fmt" "$i")"
done

wc -l * | sed 's/^\s*//;$d' |
sort -k1,1nr -k2 |
 cut -d" " -f2 |
 sed -e '${y/\n/\t/;q;}' -e 'N;H;z;x;D' |
xargs -l  paste | column-t

Output:

b      c      a      d   e
11     22     11(2)  56  11(4)
44     56     12     89  
56     78(2)  22(2)  91  
60(3)  91                
91(2)  98                
95

guest_7 · Accepted Answer · 2022-02-11 08:27:09Z

You can use perl data structures to solve this problem.

perl -F'/\t/,$_,-1'  -lane '
  $.==1 && do{
    @hdr = @F; next;
  };
  for (0..$#F) {
    my $e = $F[$_];
    next if $e eq "";
    my $main_key = $hdr[$_];
    $h{$main_key}{$e}++;
  }
  }{
  my @AoA =
  map {
    my $key = $hdr[$_];
    my $href = $h{$key};
    local($a,$b);
    [
      $key,
      map {
        my $k = $href->{$_};
        $k > 1 ? qq[$_($k)] : $_;
      }
      sort {
        $a <=> $b
      }
      keys %$href
    ]
  }
  sort {
    keys %{$h{$hdr[$b]}} <=> 
    keys %{$h{$hdr[$a]}} ||
    $a <=> $b
  }
  0..$#hdr;

  # output
  local $, = "\t";
  for (my $i=0; $AoA[0][$i] ne ""; $i++) {
    print map($AoA[$_][$i],0..$#AoA)
  }
' file

Output:

b      c      a      d   e
11     22     11(2)  56  11(4)
44     56     12     89  
56     78(2)  22(2)  91  
60(3)  91                
91(2)  98                
95

αғsнιη · Accepted Answer · 2022-02-12 14:28:54Z

Using GNU awk for the PROCINFO["sorted_in"] and knowing that columns are delimited with a single Space or Tab character which otherwise the position of each field in every lines will be vary; one input something as following (fields are enclosed within brackets for better view and number of fields in every line are the same but some fields are empty):

[a]   [b]   [c]   [d]   [e]
[11]  [11]  [22]  [56]  [11]
[11]  [44]  [56]  [89]  [11]
[12]  [56]  [78]  [91]  [11]
[22]  [60]  [78]  []    [11]
[22]  [60]  [91]  []    []
[]    [60]  [98]  []    []
[]    [91]  []    []    []
[]    [91]  []    []    []
[]    [95]  []    []    []

Here I assumed that delimiter is a single Tab character; if it's Space character just replace \t with Space in -F'[\t]' -v OFS='\t' -> -F'[ ]' -v OFS=' '.

gawk -F'[\t]' -v OFS='\t' '
function prnt(){
    for(e=1;e<=NF;e++) printf "%s", buf[e] (e==NF?ORS:OFS); split("", buf)
};
NR==1{ split($0, hdr) }
NR >1{ for(i=1; i<=NF; i++) $i!=""&& count[$i"," i]++ }

END  { for(k in count) { split(k, tmp, ","); uniq[tmp[2]]++ }
       PROCINFO["sorted_in"]="@val_num_desc"
       for(u in uniq) {
           if(!ent)ent=uniq[u]*(NF+1);
           printf "%s" ,(s?OFS:"") hdr[u]; ordr[++s]=u
       }; print ""

       PROCINFO["sorted_in"]="@ind_num_asc";
       while(v++<=ent){
           if(j++==NF){ j=0; prnt() }
           for(k in count){
               split(k, tmp, ",")
               if(tmp[2]==ordr[j]){
                   buf[j]=tmp[1] (count[k]>1?"("count[k]")":"")
                   delete count[k]; break
               };
           };
       };
}' infile

Output:

b       c       a       d       e
11      22      11(2)   56      11(4)
44      56      12      89
56      78(2)   22(2)   91
60(3)   91
91(2)   98
95

Enclosing within brackets:

b       c       a       d       e
[11]    [22]    [11(2)] [56]    [11(4)]
[44]    [56]    [12]    [89]    []
[56]    [78(2)] [22(2)] [91]    []
[60(3)] [91]    []      []      []
[91(2)] [98]    []      []      []
[95]    []      []      []      []

NR==1{ split($0, hdr) }, if it's first line (NR==1, Number of Record), then split() it into the pieces and store in the hdr array.
NR >1{ for(i=1; i<=NF; i++) $i!=""&& count[$i"," i]++ }, for any line other that the first line (NR>1) we loop over the fields (or columns; NF return the Number of Fields in a line) and if the current processing field was not empty string ($i!=""), then add that field into the array count and increment that field repeated times within that column#; In the count[i"," $i]++ array, we used combination of the columnValue+column# as the key of the array count separated with a comma and values are the occurrences of each entries that seen in the same column#;
In the END{ ... } we are doing several things:
- for(k in count) { split(k, tmp, ","); uniq[tmp[2]]++ }, here we loop over the count array's key's elemnts and split() that key (k) into a new temporary array tmp on comma separator (we said above that "we used combination of the column#+columnValue as the key of the array count separated with a comma") and get the second part (i.e, column# part) and then we are now counting the occurrences of those column#s to see how many unique entries were there within each column# (uniq[tmp[2]]++); we need this calculation to sort the output based on the columns having the maximum entries at first unto minimum entries in last; now column# are the keys part of the array uniq and values are the unique entries in each column.
- PROCINFO["sorted_in"]="@val_num_desc", this enables the GNU awk array sorting options when they traversed in the loops and we used sort on values numerical descending which will sort the array's values descending.
- for(u in uniq) { ... }, loop over the array uniq which is created earlier.
  - if(!ent)ent=uniq[u]*(NF+1);, you remember that above we sorted the array uniq on values, so the first element would be the column# with highest unique entries in it; we build ent variable with the maximum entries in that column multiply NF+1 to remember the 2D dimensions of the input file which we will rebuild output file based on this.
  - printf "%s" ,(s?OFS:"") hdr[u]; print header elements from column# with max entries to column# with min entries (array was sorted on values, so the column with highest entries now is first and so accordingly it's key in u is also would seen at first and when we do printing the hdr[u] it will print headers by that order max->min)
  - ordr[++s]=u, with this we wanted to remember which column# (keys are coming from the u from the array uniq) should print first, so we created a new temporary array ordr with keys as like 1,2,3,5,5, etc incremental and values are the column# which they should print at first;
- PROCINFO["sorted_in"]="@ind_num_asc";, here we changed the sorting method to indxes numerical ascending which will sort the array's on index/keys part.
- while(v++<=ent){ ... }, we are doing several things by opening a loop and run it until v<ent (until we traversed all cells of a 2D dimension matrix; in this example ent=6*6=36);
  - if(j++==NF){ j=0; prnt() }, if control variable j==NF then it means we traversed a full line with NF number of fields, so reset it j=0 and call prnt() function to output the line and process the next line of the code below;
  - for(k in count){ ... }, loop over the array count which is contains the entries and thier column position.
    - split(k, tmp, ","), split() the key (k) part into the temporary array tmp on comma separator.
    - if(tmp[2]==ordr[j]), if the second part of the key (i.e, column# part) was equal to the order from the ordr[#] array?
      - buf[j]=tmp[1] (count[k]>1?"("count[k]")":""), build the field by printing the first part of the key from the tmp[1] (which is columnValue+...) then print the occurrences times from the count[k] (you omitted to print the repeats for unique entries so we also skipping those entries too and printing if it count[k]>1)
      - delete count[k]; break, delete that key which is processed and break from inner first loop and run outer loop;
  - otherwise test another entries
- jump to while-loop and process all other entries
end
function prnt(){ ... }, output the line after rebuild and and empty the buf array for the next processing.

Stéphane Chazelas · Accepted Answer · 2022-02-13 07:40:56Z

0

It becomes easier if you transpose your table so columns become lines. You could do that with BSD rs for instance:

< yourfile unexpand -t6 | # convert to tsv assumming 6-column wide columns
  rs -nTc | # transpose
  gawk -v OFS='\t' '{
    c=0
    for (i=j=2;++j<=NF+1;)
      if ($i == $j"")
        c++
      else{
        if (c++) $i=$i"("c")"
        $++i=$j
        c=0
      }
    print NF=--i,$0
  }' |
  sort -rn |
  cut -f2- |
  rs -nTc

edited Feb 13, 2022 at 7:40

answered Feb 12, 2022 at 13:18

Stéphane Chazelas

588k96 gold badges1.1k silver badges1.7k bronze badges

if you change -vOFS to -v OFS, i.e. with a space between, then that part will work in any awk but if print NF=--i is trying to reduce the value of NF to strip trailing fields (sorry, I can't tell what the code is doing and I don't have unexpand or rs on my system to test with, so I'm just guessing based on the --i), then that's undefined behavior so some awks will do that while others will ignore it.

Ed Morton
– Ed Morton

2022-02-12 18:45:03 +00:00
Commented Feb 12, 2022 at 18:45
1

@Ed I've changed to gawk for now. I'll make it more portable when I have a moment

Stéphane Chazelas
– Stéphane Chazelas

2022-02-13 07:42:19 +00:00
Commented Feb 13, 2022 at 7:42

Add a comment |

guest_7 · Accepted Answer · 2022-02-14 06:09:59Z

We can use python, with the zip_longest method from the itertools module to interleave the columnar lists.

python3 -c 'import sys, itertools as it
fs,rs = "\t","\n"
ofs,ors = fs,rs
t = ()
with open(sys.argv[1]) as f:
  for nr,rec in enumerate(f):
    F = rec.rstrip(rs).split(fs)
    if not nr:
      LoL = [[e] for e in F]
    else:
      for idx,el in enumerate(F):
        if el == "": continue
        if not len(t):
          LoL[idx].append(el)
          t = el,
        else:
          p = LoL[idx]
          t = p[-1].replace("(",")").split(")")
          if t[0] == el:
            p[-1] = "%s(%d)" % (el,(int(t[1])+1 if len(t) > 1 else 2))
          else: p.append(el)

  # output
  for tup in it.zip_longest(*sorted(LoL,reverse=True,key=len),fillvalue=""):
    print(*tup,sep=ofs)
' file

Output:

b      c      a      d   e
11     22     11(2)  56  11(4)
44     56     12     89  
56     78(2)  22(2)  91  
60(3)  91                
91(2)  98                
95

The above assumes that the elements in each column don't come repeated later on down in that column.

Should this scenario not hold, we could use the following approach which combines the ordering of list, uniquifying nature of a set, and the list method count that enumerates how many times a given element was present in the input list.

python3 -c 'import sys, itertools as it

fs,rs = "\t","\n"
ofs = fs

with open(sys.argv[1]) as f:
  for nr,_ in enumerate(f):
    F = _.rstrip(rs).split(fs)
    if not nr:
      LoL = [[e] for e in F]
    else:
      for i,e in enumerate(F):
        if len(e): LoL[i] += [e]

  for idx in range(len(LoL)):
    s = set(LoL[idx])
    l = []
    for el in LoL[idx]:
      if el in s:
        s -= {el}
        k = LoL[idx].count(el)
        if k > 1: el += f"({k})"
        l += [el]
    LoL[idx] = l

  for t in it.zip_longest(*sorted(LoL,key=len,reverse=True),fillvalue=""):
    print(*t,sep=ofs)
' file

Stack Exchange Network

count duplicate entries within each column

6 Answers 6

You must log in to answer this question.

Hot Network Questions

count duplicate entries within each column

6 Answers 6

You must log in to answer this question.

Related

Hot Network Questions