Remove duplicate lines based on starting pattern using bash

Question

I'm trying to remove duplicates in a list of Jira tickets that follow the following syntax:

XXXX-12345: a description

where 12345 is a pattern like [0-9]+ and the XXXX is constant. For example, the following list:

XXXX-1111: a description
XXXX-2222: another description
XXXX-1111: yet another description

should get cleaned up like this:

XXXX-1111: a description
XXXX-2222: another description

I've been trying using sed but while what I had worked on Mac it didn't on linux. I think it'd be easier with awk but I'm not an expert on any of them.

I tried:

sed -r '$!N; /^XXXX-[0-9]+\n\1/!P; D' file

Replacing $0 with $1 in the accepted answer to this related question should do the trick — Thor
– Thor, Commented Dec 10, 2020 at 16:48
@Thor Thanks! it worked. Could you explain the command to me, please? I understand the idea behind using awk '!seen' but I don't understand why $1 or how it identifies the pattern in my use case. — Juan Vega
– Juan Vega, Commented Dec 10, 2020 at 18:32
@JuanVega: awk splits each line into fields according to what FS is set to, it defaults to sequences of spaces and tabs. This splitting sets the positional variables $1, $2, ... accordingly, so $1 is the first field, up-to the first space/tab — Thor
– Thor, Commented Dec 10, 2020 at 18:37
@anubhava I was trying to use sed -r '$!N; /^XXXX-[0-9]+\n\1/!P; D' as I found another answer where it was used to delete duplicated lines. In the original answer instead of XXXX-[0-9]+ there was (.*). But it's sure I don't get how it works because it doesn't work. — Juan Vega
– Juan Vega, Commented Dec 10, 2020 at 18:37

anubhava · Accepted Answer · 2020-12-10 19:04:05Z

1

This simple awk should get the output:

awk '!seen[$1]++' file

XXXX-1111: a description
XXXX-2222: another description

answered Dec 10, 2020 at 19:04

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Juan Vega Over a year ago

Yes, I ended up using that one as also suggested by @Thor. Thanks!

dawg · Accepted Answer · 2020-12-10 19:06:39Z

0

If the digits are the only thing defining a dup, you could do:

awk -F: '{split($1,arr,/-/); if (seen[arr[2]]++) next} 1' file

If the XXXX is always the same, you can simplify to:

awk -F: '!seen[$1]++' file

Either prints:

XXXX-1111: a description
XXXX-2222: another description

edited Dec 10, 2020 at 19:06

answered Dec 10, 2020 at 19:01

dawg

105k24 gold badges143 silver badges217 bronze badges

1 Comment

Juan Vega Over a year ago

Thanks! I keep that first one in mind if the characters end up changing at some point.

potong · Accepted Answer · 2020-12-11 12:51:49Z

0

This might work for you (GNU sed):

sed -nE 'G;/^([^:]*:).*\n\1/d;P;h' file

-nE turn on explicit printing and extended regexps.
G append unique lines from the hold space to the current line.
/^([^:]*:).*\n\1/d If the current line key already exists, delete it.
P otherwise, print the current line and
h store unique lines in the hold space

N.B. Your sed solution would work (not as is but with some tweaking) but only if the file(s) were sorted by the key.

sed -E 'N;/^([^:]*:).*\n\1/!P;D' file

edited Dec 11, 2020 at 12:51

answered Dec 11, 2020 at 12:32

potong

59.3k6 gold badges55 silver badges92 bronze badges

3 Comments

Juan Vega Over a year ago

I didn't add the code but yes, I had lines sorted first before using my no solution. I'm curious, is the solution you propose the tweaking I would need? I'm not an expert on regex expressions so, what does that regex do exactly to only use the XXXX-1234 part in the comparison?

Juan Vega Over a year ago

Thanks for the explanation!

potong Over a year ago

@JuanVega in regexp you can group matching parts by enclosing them in parens. You can then refer to these grouping by a back reference which are numbered starting from the left most paren. e.g. /(aaa)(bbb)\1\2/ would match the string aaabbbaaabbb and /((aaa)bbb)\1\2/' would match aaabbbaaabbbaaa. Thus the regexp /^([^:]*:).*\n\1/ would match the same key twice and in the solution above, would delete that line. HTH BTW the first solution works sorted or unsorted the second only when sorted

Collectives™ on Stack Overflow

Remove duplicate lines based on starting pattern using bash

3 Answers 3

1 Comment

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related