remove duplicates in csv column and count them adding new column with the count number

Question

I need help to remove duplicates in the 4th column (num3) in that csv file, then add new column named (count) at the end with number of duplicates for every number in column, the sort all the rows due to the number of duplicates in the new column named (count), using shell script or Python.

INPUT:

id,num1,num2,num3
1,300,200,1121
2,300,190,1122
3,300,180,1123
4,300,170,1124
5,300,160,1125
6,300,150,1126
7,300,140,1127
8,300,130,1128
9,300,120,1129
10,300,195,1122
11,300,185,1122
12,300,175,1126
13,300,165,1122
14,300,155,1122
15,300,145,1122
16,300,135,1122

I need the OUTPUT to be like this:

id,num1,num2,num3,count
2,300,190,1122,7
6,300,150,1126,2
1,300,200,1121,1
3,300,180,1123,1
4,300,170,1124,1
5,300,160,1125,1
7,300,140,1127,1
8,300,130,1128,1
9,300,120,1129,1

Please explain the condition to be applied after it is counted by num3. What to keep? — aborruso
– aborruso, Commented May 7 at 8:03
To clarify, you have 7 rows with num3=1122. Which of those 7 rows do you keep for the output? They all have different num2. The highest num2? Or the first row with num3=1122? Or what? Edit the question and add the clarification. — Mark Tolonen
– Mark Tolonen, Commented May 7 at 22:54
have you tried this? not good? stackoverflow.com/a/79610054/757714 — aborruso
– aborruso, Commented May 8 at 15:41
it works when i added pipe to the mlr command "| mlr --csv sort -r count" — xsukax
– xsukax, Commented May 9 at 20:21

Zach Young · Accepted Answer · 2025-05-07 03:48:07Z

1

For shell script, can you install a CSV-aware tool that can do this? I work on GoCSV and it has three subcommands, uniq, rename, and sort, that can do this easily:

gocsv uniq -c=num3 -count input.csv \
| gocsv rename -c=Count -names=count \
| gocsv sort -c=count -reverse \
> output.csv

id,num1,num2,num3,count
2,300,190,1122,7
6,300,150,1126,2
1,300,200,1121,1
...

The -count flag adds a column named Count, so you'll need rename to change Count to count.

If you cannot install another tool, the following Python script can also accomplish this.

It creates a reader around the input CSV, reads the header, then loops through the rest of the rows, counting occurences of the num and adding the row to uniqs only if the num hasn't been seen before.
It sorts the uniqs, in reverse, by comparing the count for each row.
It writes the header+count, and loops through the rows writing each plus its count.

import csv
from collections import Counter

i = 3

counts = Counter()
uniqs: list[list[str]] = []

with open("input.csv", newline="") as f:
    reader = csv.reader(f)
    header = next(reader)

    for row in reader:
        num = row[i]

        counts[num] += 1
        if counts[num] > 1:
            continue

        uniqs.append(row)

uniqs.sort(key=lambda row: counts[row[i]], reverse=True)

with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(header + ["count"])
    for row in uniqs:
        writer.writerow(row + [counts[row[i]]])

edited May 7 at 3:48

answered May 7 at 0:40

Zach Young

11.4k4 gold badges38 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

xsukax May 9 at 21:02

I tried gocsv and it works perfect, and i tried it also on windows, Thanks.

jhnc · Accepted Answer · 2025-05-09 20:57:35Z

1

You ask for "sort all the rows due to the number of duplicates in the new column". Note that for the sample output provided, any permutation of the rows where count==1 would meet your specification, so further details are needed if a particular order of such rows is required.

Also the definition of "duplicates" is underspecified. There are 7 rows where num3 is 1122 but no reason is provided why any one of those lines should be considered the "original" (ie. not a duplicate and so shown in the output).

Here's an approach using SQL:

$ sqlite3 -csv -header <<'EOD'
.import "file.csv" t
select *, count(num3) as count
    from t
    group by num3
    order by count desc, id asc;
EOD
id,num1,num2,num3,count
2,300,190,1122,7
6,300,150,1126,2
1,300,200,1121,1
3,300,180,1123,1
4,300,170,1124,1
5,300,160,1125,1
7,300,140,1127,1
8,300,130,1128,1
9,300,120,1129,1
$

With awk and sort:

$ awk -F, '
    { row[NR]=$0; num3[NR]=$NF; ++count[$NF] }
    END {
      print row[1] ",count"
      for (i=2; i<=NR; ++i)
        if (!seen[num3[i]]++)
          print row[i] "," count[num3[i]] | "sort -t, -n -k5r -k1,1"
    }
  ' file.csv
id,num1,num2,num3,count
2,300,190,1122,7
6,300,150,1126,2
1,300,200,1121,1
3,300,180,1123,1
4,300,170,1124,1
5,300,160,1125,1
7,300,140,1127,1
8,300,130,1128,1
9,300,120,1129,1
$

store all rows and num3 and count of times each num3 was seen
after all rows have been read:
- print header
- print first row where each particular num3 is found
  - pipe through sort, ordering by count descending and id ascending

If your sort has a -s option, you could replace sort -t, -n -k5r -k1,1 with sort -s -t, -n -kr5r to retain the relative input order on rows having the same count.

edited May 9 at 20:57

answered May 6 at 23:28

jhnc

18.7k2 gold badges14 silver badges33 bronze badges

4 Comments

xsukax May 7 at 5:42

You are right .. i edited the output now as i described .. thank you for your note & answer .. i've tried your awk answer but it didn't gave me the same output as yours .. anyway thanks for your answer.

jhnc May 7 at 5:53

check for typos. I get the same output with four different awk implementations, so fairly certain the code is okay. if your real data contains embedded commas inside fields, or a different number of fields, or num3 is not the final field on the line, then you'll need to adjust the program

jhnc May 9 at 21:00

sorry, I accidentally broke the sqlite version by moving some dot commands to arguments but mistyping them. should work now

xsukax May 9 at 21:11

Nvm, thanks for your effort. it works great now.

LifeAsPixels · Accepted Answer · 2025-05-06 23:57:52Z

I would use pandas for data aggregation as pandas is great at data manipulation. I didn't try for long but I struggled to get pandas to stop treating the first column as an index after the aggregation and the 'num3' got stuck in the first column. Other than that, the result is what you are looking for albeit in an order other than what you specified.

Make the dataframe, aggregate the data, rename and re-order the columns, print it csv style:

Edit: I managed to realize where I was going wrong, and recreated a dataframe with the new csv style and got the columns ordered how you were looking for them. As for the row level sorting, I left that alone and assume it is good enough.

import pandas as pd
from io import StringIO

csv_data = """id,num1,num2,num3
1,300,200,1121
2,300,190,1122
3,300,180,1123
4,300,170,1124
5,300,160,1125
6,300,150,1126
7,300,140,1127
8,300,130,1128
9,300,120,1129
10,300,195,1122
11,300,185,1122
12,300,175,1126
13,300,165,1122
14,300,155,1122
15,300,145,1122
16,300,135,1122"""

df = pd.read_csv(StringIO(csv_data)) # make the data frame
# print(df)

# aggregate results 'column : aggregate type'
df = df.groupby('num3').agg({ 
    'id' : 'min',
    'num1': 'max',
    'num2': 'max',
    'num3': 'max',
    'num3': 'count',
})
df.columns = ['id', 'num1', 'num2', 'num3_count'] # rename the columns
# print(df)

# reorder the columns as they are stored to a csv style
df = df.to_csv(header = True, columns = ['id', 'num1', 'num2', 'num3_count'])

# recreate the dataframe from the new csv
df = pd.read_csv(StringIO(df)) 

# reorder the final output as a csv
df = df.to_csv(index = False, header = True, columns = ['id', 'num1', 'num2', 'num3', 'num3_count'])

print(df) # print the dataframe to console for viewing

Returns:

id,num1,num2,num3,num3_count
1,300,200,1121,1
2,300,195,1122,7
3,300,180,1123,1
4,300,170,1124,1
5,300,160,1125,1
6,300,175,1126,2
7,300,140,1127,1
8,300,130,1128,1
9,300,120,1129,1

aborruso · Accepted Answer · 2025-05-11 10:17:12Z

0

your goal is not clear: after counting by num3 values, by what criteria should I group the others? In your example it looks like you only keep multiples of ten in num2

If so, using Miller, you can run

mlr --csv count-similar -g num3 then filter '$num2=~"0$"' then sort -r count input.csv

to get

id,num1,num2,num3,count
2,300,190,1122,7
6,300,150,1126,2
1,300,200,1121,1
3,300,180,1123,1
4,300,170,1124,1
5,300,160,1125,1
7,300,140,1127,1
8,300,130,1128,1
9,300,120,1129,1

edited May 11 at 10:17

answered May 7 at 8:01

aborruso

5,8723 gold badges27 silver badges49 bronze badges

1 Comment

xsukax May 9 at 20:24

it works good after i added "| mlr --csv sort -r count" to this command to sort them descendingly , thanks a lot for your effort.

Collectives™ on Stack Overflow

remove duplicates in csv column and count them adding new column with the count number

4 Answers 4

1 Comment

4 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

4 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related