0

I need help to remove duplicates in the 4th column (num3) in that csv file, then add new column named (count) at the end with number of duplicates for every number in column, the sort all the rows due to the number of duplicates in the new column named (count), using shell script or Python.

INPUT:

id,num1,num2,num3
1,300,200,1121
2,300,190,1122
3,300,180,1123
4,300,170,1124
5,300,160,1125
6,300,150,1126
7,300,140,1127
8,300,130,1128
9,300,120,1129
10,300,195,1122
11,300,185,1122
12,300,175,1126
13,300,165,1122
14,300,155,1122
15,300,145,1122
16,300,135,1122

I need the OUTPUT to be like this:

id,num1,num2,num3,count
2,300,190,1122,7
6,300,150,1126,2
1,300,200,1121,1
3,300,180,1123,1
4,300,170,1124,1
5,300,160,1125,1
7,300,140,1127,1
8,300,130,1128,1
9,300,120,1129,1
4
  • Please explain the condition to be applied after it is counted by num3. What to keep? Commented May 7 at 8:03
  • To clarify, you have 7 rows with num3=1122. Which of those 7 rows do you keep for the output? They all have different num2. The highest num2? Or the first row with num3=1122? Or what? Edit the question and add the clarification. Commented May 7 at 22:54
  • have you tried this? not good? stackoverflow.com/a/79610054/757714 Commented May 8 at 15:41
  • it works when i added pipe to the mlr command "| mlr --csv sort -r count" Commented May 9 at 20:21

4 Answers 4

1

For shell script, can you install a CSV-aware tool that can do this? I work on GoCSV and it has three subcommands, uniq, rename, and sort, that can do this easily:

gocsv uniq -c=num3 -count input.csv \
| gocsv rename -c=Count -names=count \
| gocsv sort -c=count -reverse \
> output.csv
id,num1,num2,num3,count
2,300,190,1122,7
6,300,150,1126,2
1,300,200,1121,1
...

The -count flag adds a column named Count, so you'll need rename to change Count to count.

If you cannot install another tool, the following Python script can also accomplish this.

  1. It creates a reader around the input CSV, reads the header, then loops through the rest of the rows, counting occurences of the num and adding the row to uniqs only if the num hasn't been seen before.
  2. It sorts the uniqs, in reverse, by comparing the count for each row.
  3. It writes the header+count, and loops through the rows writing each plus its count.
import csv
from collections import Counter

i = 3

counts = Counter()
uniqs: list[list[str]] = []

with open("input.csv", newline="") as f:
    reader = csv.reader(f)
    header = next(reader)

    for row in reader:
        num = row[i]

        counts[num] += 1
        if counts[num] > 1:
            continue

        uniqs.append(row)

uniqs.sort(key=lambda row: counts[row[i]], reverse=True)

with open("output.csv", "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(header + ["count"])
    for row in uniqs:
        writer.writerow(row + [counts[row[i]]])
Sign up to request clarification or add additional context in comments.

1 Comment

I tried gocsv and it works perfect, and i tried it also on windows, Thanks.
1

You ask for "sort all the rows due to the number of duplicates in the new column". Note that for the sample output provided, any permutation of the rows where count==1 would meet your specification, so further details are needed if a particular order of such rows is required.

Also the definition of "duplicates" is underspecified. There are 7 rows where num3 is 1122 but no reason is provided why any one of those lines should be considered the "original" (ie. not a duplicate and so shown in the output).


Here's an approach using SQL:

$ sqlite3 -csv -header <<'EOD'
.import "file.csv" t
select *, count(num3) as count
    from t
    group by num3
    order by count desc, id asc;
EOD
id,num1,num2,num3,count
2,300,190,1122,7
6,300,150,1126,2
1,300,200,1121,1
3,300,180,1123,1
4,300,170,1124,1
5,300,160,1125,1
7,300,140,1127,1
8,300,130,1128,1
9,300,120,1129,1
$

With awk and sort:

$ awk -F, '
    { row[NR]=$0; num3[NR]=$NF; ++count[$NF] }
    END {
      print row[1] ",count"
      for (i=2; i<=NR; ++i)
        if (!seen[num3[i]]++)
          print row[i] "," count[num3[i]] | "sort -t, -n -k5r -k1,1"
    }
  ' file.csv
id,num1,num2,num3,count
2,300,190,1122,7
6,300,150,1126,2
1,300,200,1121,1
3,300,180,1123,1
4,300,170,1124,1
5,300,160,1125,1
7,300,140,1127,1
8,300,130,1128,1
9,300,120,1129,1
$
  • store all rows and num3 and count of times each num3 was seen
  • after all rows have been read:
    • print header
    • print first row where each particular num3 is found
      • pipe through sort, ordering by count descending and id ascending

If your sort has a -s option, you could replace sort -t, -n -k5r -k1,1 with sort -s -t, -n -kr5r to retain the relative input order on rows having the same count.

4 Comments

You are right .. i edited the output now as i described .. thank you for your note & answer .. i've tried your awk answer but it didn't gave me the same output as yours .. anyway thanks for your answer.
check for typos. I get the same output with four different awk implementations, so fairly certain the code is okay. if your real data contains embedded commas inside fields, or a different number of fields, or num3 is not the final field on the line, then you'll need to adjust the program
sorry, I accidentally broke the sqlite version by moving some dot commands to arguments but mistyping them. should work now
Nvm, thanks for your effort. it works great now.
0

I would use pandas for data aggregation as pandas is great at data manipulation. I didn't try for long but I struggled to get pandas to stop treating the first column as an index after the aggregation and the 'num3' got stuck in the first column. Other than that, the result is what you are looking for albeit in an order other than what you specified.

Make the dataframe, aggregate the data, rename and re-order the columns, print it csv style:

Edit: I managed to realize where I was going wrong, and recreated a dataframe with the new csv style and got the columns ordered how you were looking for them. As for the row level sorting, I left that alone and assume it is good enough.

import pandas as pd
from io import StringIO

csv_data = """id,num1,num2,num3
1,300,200,1121
2,300,190,1122
3,300,180,1123
4,300,170,1124
5,300,160,1125
6,300,150,1126
7,300,140,1127
8,300,130,1128
9,300,120,1129
10,300,195,1122
11,300,185,1122
12,300,175,1126
13,300,165,1122
14,300,155,1122
15,300,145,1122
16,300,135,1122"""

df = pd.read_csv(StringIO(csv_data)) # make the data frame
# print(df)

# aggregate results 'column : aggregate type'
df = df.groupby('num3').agg({ 
    'id' : 'min',
    'num1': 'max',
    'num2': 'max',
    'num3': 'max',
    'num3': 'count',
})
df.columns = ['id', 'num1', 'num2', 'num3_count'] # rename the columns
# print(df)

# reorder the columns as they are stored to a csv style
df = df.to_csv(header = True, columns = ['id', 'num1', 'num2', 'num3_count'])

# recreate the dataframe from the new csv
df = pd.read_csv(StringIO(df)) 

# reorder the final output as a csv
df = df.to_csv(index = False, header = True, columns = ['id', 'num1', 'num2', 'num3', 'num3_count'])

print(df) # print the dataframe to console for viewing

Returns:

id,num1,num2,num3,num3_count
1,300,200,1121,1
2,300,195,1122,7
3,300,180,1123,1
4,300,170,1124,1
5,300,160,1125,1
6,300,175,1126,2
7,300,140,1127,1
8,300,130,1128,1
9,300,120,1129,1

Comments

0

your goal is not clear: after counting by num3 values, by what criteria should I group the others? In your example it looks like you only keep multiples of ten in num2

If so, using Miller, you can run

mlr --csv count-similar -g num3 then filter '$num2=~"0$"' then sort -r count input.csv 

to get

id,num1,num2,num3,count
2,300,190,1122,7
6,300,150,1126,2
1,300,200,1121,1
3,300,180,1123,1
4,300,170,1124,1
5,300,160,1125,1
7,300,140,1127,1
8,300,130,1128,1
9,300,120,1129,1

1 Comment

it works good after i added "| mlr --csv sort -r count" to this command to sort them descendingly , thanks a lot for your effort.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.