1

I'm trying to remove duplicates in a list of Jira tickets that follow the following syntax:

XXXX-12345: a description

where 12345 is a pattern like [0-9]+ and the XXXX is constant. For example, the following list:

XXXX-1111: a description
XXXX-2222: another description
XXXX-1111: yet another description

should get cleaned up like this:

XXXX-1111: a description
XXXX-2222: another description

I've been trying using sed but while what I had worked on Mac it didn't on linux. I think it'd be easier with awk but I'm not an expert on any of them.

I tried:

sed -r '$!N; /^XXXX-[0-9]+\n\1/!P; D' file
7
  • 2
    Replacing $0 with $1 in the accepted answer to this related question should do the trick Commented Dec 10, 2020 at 16:48
  • 1
    Can you show your attempted code? Commented Dec 10, 2020 at 17:13
  • @Thor Thanks! it worked. Could you explain the command to me, please? I understand the idea behind using awk '!seen' but I don't understand why $1 or how it identifies the pattern in my use case. Commented Dec 10, 2020 at 18:32
  • @JuanVega: awk splits each line into fields according to what FS is set to, it defaults to sequences of spaces and tabs. This splitting sets the positional variables $1, $2, ... accordingly, so $1 is the first field, up-to the first space/tab Commented Dec 10, 2020 at 18:37
  • @anubhava I was trying to use sed -r '$!N; /^XXXX-[0-9]+\n\1/!P; D' as I found another answer where it was used to delete duplicated lines. In the original answer instead of XXXX-[0-9]+ there was (.*). But it's sure I don't get how it works because it doesn't work. Commented Dec 10, 2020 at 18:37

3 Answers 3

1

This simple awk should get the output:

awk '!seen[$1]++' file

XXXX-1111: a description
XXXX-2222: another description
Sign up to request clarification or add additional context in comments.

1 Comment

Yes, I ended up using that one as also suggested by @Thor. Thanks!
0

If the digits are the only thing defining a dup, you could do:

awk -F: '{split($1,arr,/-/); if (seen[arr[2]]++) next} 1' file

If the XXXX is always the same, you can simplify to:

awk -F: '!seen[$1]++' file

Either prints:

XXXX-1111: a description
XXXX-2222: another description

1 Comment

Thanks! I keep that first one in mind if the characters end up changing at some point.
0

This might work for you (GNU sed):

sed -nE 'G;/^([^:]*:).*\n\1/d;P;h' file
  • -nE turn on explicit printing and extended regexps.
  • G append unique lines from the hold space to the current line.
  • /^([^:]*:).*\n\1/d If the current line key already exists, delete it.
  • P otherwise, print the current line and
  • h store unique lines in the hold space

N.B. Your sed solution would work (not as is but with some tweaking) but only if the file(s) were sorted by the key.

sed -E 'N;/^([^:]*:).*\n\1/!P;D' file

3 Comments

I didn't add the code but yes, I had lines sorted first before using my no solution. I'm curious, is the solution you propose the tweaking I would need? I'm not an expert on regex expressions so, what does that regex do exactly to only use the XXXX-1234 part in the comparison?
Thanks for the explanation!
@JuanVega in regexp you can group matching parts by enclosing them in parens. You can then refer to these grouping by a back reference which are numbered starting from the left most paren. e.g. /(aaa)(bbb)\1\2/ would match the string aaabbbaaabbb and /((aaa)bbb)\1\2/' would match aaabbbaaabbbaaa. Thus the regexp /^([^:]*:).*\n\1/ would match the same key twice and in the solution above, would delete that line. HTH BTW the first solution works sorted or unsorted the second only when sorted

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.