0

I was trying to remove some duplicate string in a line by line text. eg:

A {id: "x" p {id: "vcv" v: "i4"} on:taf"}
A {id: "y" p {id: "wse" v: "i4"} on:ue"}
A {id: "z" p {id: "das" v: "i4"} on:tade"}
A {id: "x" p {id: "da" v: "i4"} on:faer"}
A {id: "y" p {id: "werw" v: "i4"} on:asee"}
A {id: "y" p {id: "werw" v: "i4"} on:asee"}

the output should be the ones with no duplicated A_id, which means the output should be:

A {id: "x" p {id: "vcv" v: "i4"} on:taf"}
A {id: "y" p {id: "wse" v: "i4"} on:ue"}
A {id: "z" p {id: "das" v: "i4"} on:tade"}

The problem I met was I don't know how to sort and make it unique with a substring only. I tried to use:

cat input.txt | grep 'A\s\{id:\s\"[^;]*\sp\s\{id:' | sort -u > output.txt

But it doesn't remove the duplicate substring but only remove lines which are exactly the same with others. So it's like it only removed:

A {id: "y" p {id: "werw" v: "i4"} on:asee"}

which is all the same with the last two lines, but didn't remove:

A {id: "y" p {id: "wse" v: "i4"} on:ue"}

which has the duplicate id but different content.

3 Answers 3

2

An awk solution:

$ awk '!a[$3]++' file
A {id: "x" p {id: "vcv" v: "i4"} on:taf"}
A {id: "y" p {id: "wse" v: "i4"} on:ue"}
A {id: "z" p {id: "das" v: "i4"} on:tade"}

Combing the matching from your grep command:

$ awk '$1=="A" && $2=="{id:" && $4=="p" && $5=="{id:" && !a[$3]++' file
A {id: "x" p {id: "vcv" v: "i4"} on:taf"}
A {id: "y" p {id: "wse" v: "i4"} on:ue"}
A {id: "z" p {id: "das" v: "i4"} on:tade"}
Sign up to request clarification or add additional context in comments.

Comments

1

The problem is that sort uses the entire string as key by default, so it would only eliminate identical lines.

Try changing

sort -u

to

sort -uk3,3

to eliminate duplicates where the key is the 3rd field. Fields are separated by white-space.

-k, --key=POS1[,POS2] start a key at POS1, end it at POS2 (origin 1)

POS is F[.C][OPTS], where F is the field number and C the character position in the field. OPTS is one or more single-letter ordering options, which override global ordering options for that key. If no key is given, use the entire line as the key.

Reference.

8 Comments

I used command like this :cat ~/lexicon.clex | sort -u -k 1 20 > ~/output.clex, but it said "sort: open failed: 20: No such file or directory"
@QingshanZhang I think it was a syntax error from my side, try adding a comma between 1 and 10, see edit.
@Dukeling the syntax error was caused from the missing comma, this doesn't do what the OP needs.
ye that works now but didn't remove duplicate substring, the one with awk works! But thank you so much for the help anyway! don't know who down-vote this but I vote it to balance it.:)
@sudo_O I don't see the problem (not that I have a Linux machine to test this on), except that maybe the result of grep includes results in a format other than those listed (which I suppose is quite possible from that regex).
|
0

A Perl solution:

perl -ne 'if (/\{id: "([^"]+)"/ and not exists $h{$1}) { $h{$1}++; print }'

It stores the ids that matched in a hash, and only prints if the id was not already in the hash.

1 Comment

sorry I'm not familiar with perl as well, but thank you for the help:)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.