Remove duplicate string key in a text using command line

Question

I was trying to remove some duplicate string in a line by line text. eg:

A {id: "x" p {id: "vcv" v: "i4"} on:taf"}
A {id: "y" p {id: "wse" v: "i4"} on:ue"}
A {id: "z" p {id: "das" v: "i4"} on:tade"}
A {id: "x" p {id: "da" v: "i4"} on:faer"}
A {id: "y" p {id: "werw" v: "i4"} on:asee"}
A {id: "y" p {id: "werw" v: "i4"} on:asee"}

the output should be the ones with no duplicated A_id, which means the output should be:

A {id: "x" p {id: "vcv" v: "i4"} on:taf"}
A {id: "y" p {id: "wse" v: "i4"} on:ue"}
A {id: "z" p {id: "das" v: "i4"} on:tade"}

The problem I met was I don't know how to sort and make it unique with a substring only. I tried to use:

cat input.txt | grep 'A\s\{id:\s\"[^;]*\sp\s\{id:' | sort -u > output.txt

But it doesn't remove the duplicate substring but only remove lines which are exactly the same with others. So it's like it only removed:

A {id: "y" p {id: "werw" v: "i4"} on:asee"}

which is all the same with the last two lines, but didn't remove:

A {id: "y" p {id: "wse" v: "i4"} on:ue"}

which has the duplicate id but different content.

Chris Seymour · Accepted Answer · 2013-03-06 14:42:25Z

2

An awk solution:

$ awk '!a[$3]++' file
A {id: "x" p {id: "vcv" v: "i4"} on:taf"}
A {id: "y" p {id: "wse" v: "i4"} on:ue"}
A {id: "z" p {id: "das" v: "i4"} on:tade"}

Combing the matching from your grep command:

$ awk '$1=="A" && $2=="{id:" && $4=="p" && $5=="{id:" && !a[$3]++' file
A {id: "x" p {id: "vcv" v: "i4"} on:taf"}
A {id: "y" p {id: "wse" v: "i4"} on:ue"}
A {id: "z" p {id: "das" v: "i4"} on:tade"}

edited Mar 6, 2013 at 14:42

answered Mar 6, 2013 at 14:31

Chris Seymour

86.4k32 gold badges166 silver badges209 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Chris Seymour · Accepted Answer · 2013-03-06 15:08:10Z

1

The problem is that sort uses the entire string as key by default, so it would only eliminate identical lines.

Try changing

sort -u

to

sort -uk3,3

to eliminate duplicates where the key is the 3rd field. Fields are separated by white-space.

-k, --key=POS1[,POS2] start a key at POS1, end it at POS2 (origin 1)

POS is F[.C][OPTS], where F is the field number and C the character position in the field. OPTS is one or more single-letter ordering options, which override global ordering options for that key. If no key is given, use the entire line as the key.

Reference.

edited Mar 6, 2013 at 15:08

Chris Seymour

86.4k32 gold badges166 silver badges209 bronze badges

answered Mar 6, 2013 at 14:25

Bernhard Barker

55.7k14 gold badges111 silver badges143 bronze badges

8 Comments

Qingshan Zhang Over a year ago

I used command like this :cat ~/lexicon.clex | sort -u -k 1 20 > ~/output.clex, but it said "sort: open failed: 20: No such file or directory"

Bernhard Barker Over a year ago

@QingshanZhang I think it was a syntax error from my side, try adding a comma between 1 and 10, see edit.

Chris Seymour Over a year ago

@Dukeling the syntax error was caused from the missing comma, this doesn't do what the OP needs.

Qingshan Zhang Over a year ago

ye that works now but didn't remove duplicate substring, the one with awk works! But thank you so much for the help anyway! don't know who down-vote this but I vote it to balance it.:)

Bernhard Barker Over a year ago

@sudo_O I don't see the problem (not that I have a Linux machine to test this on), except that maybe the result of grep includes results in a format other than those listed (which I suppose is quite possible from that regex).

|

user1919238 · Accepted Answer · 2013-03-06 14:25:17Z

0

A Perl solution:

perl -ne 'if (/\{id: "([^"]+)"/ and not exists $h{$1}) { $h{$1}++; print }'

It stores the ids that matched in a hash, and only prints if the id was not already in the hash.

answered Mar 6, 2013 at 14:25

user1919238

1 Comment

Qingshan Zhang Over a year ago

sorry I'm not familiar with perl as well, but thank you for the help:)

Collectives™ on Stack Overflow

Remove duplicate string key in a text using command line

3 Answers 3

Comments

8 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

8 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related