Sort -u without sorting but with better uniqueness? [duplicate]

Question

I don't want to sort my file, just filter out duplicate lines, maintaining the original ordering. Is there a way to use sort's unique function without it's sort function (something like cat -u would give if it existed)? Just using uniq without sort does nothing worthwhile, because uniq only looks at adjacent lines, so a file has to be sorted first.

Also, incidentally, what in hell is the difference between uniq and uniq --unique? Here are commands on a random file from pastebin:

wget -qO - http://pastebin.com/0cSPs9LR | wc -l
350
wget -qO - http://pastebin.com/0cSPs9LR | sort -u | wc -l
287
wget -qO - http://pastebin.com/0cSPs9LR | sort | uniq | wc -l
287
wget -qO - http://pastebin.com/0cSPs9LR | sort | uniq -u | wc -l
258

In summary:

How do I filter duplicates greedily without sorting?
How is uniq not unique enough that there is also uniq --unique?

p.s. This question looks like a duplicate of the following q's, but it isn't:

Don't use sort or uniq at all. And "How is uniq not unique enough that there is also uniq --unique?" really should be a separate question. — muru
– muru, Commented Jun 18, 2015 at 10:06
The solution on the duplicate page has an in-bash solution using awk. Suits me. As for the separate question, I just posted it here: unix.stackexchange.com/questions/210528/… — enfascination
– enfascination, Commented Jun 18, 2015 at 10:21

Sobrique · Accepted Answer · 2015-06-18 10:12:43Z

I'd use perl and a hash.

Something like:

 #!/usr/bin/perl

 use strict;
 use warnings;

 my %seen; 

 while ( <> ) { 
     print unless $seen{$_}++; 
 }

I think this'd one-liner-ify as:

perl -ne 'print unless $seen{$_}++' data.txt

(Or cat data into it).

This works on getting unique whole lines - you can also use split or regular expressions to just compare subsets.

E.g.

while ( <> ) { 
    my @fields = split ( ";" ); 
    print unless $seen{$fields[4]}++; 
}

Will split the line into fields based on ;, and just compare the 5th (first is zero in the array).

Stack Exchange Network

Sort -u without sorting but with better uniqueness? [duplicate]

1 Answer 1

Linked

Hot Network Questions

Sort -u without sorting but with better uniqueness? [duplicate]

1 Answer 1

Linked

Related

Hot Network Questions