remove duplicate lines based on specific strings

Question

How can I remove duplicate lines based on specific strings or characters?

For example, I have a file which contains the followings:

https://example.com/?first=one&second=two&third=three
https://example.com/?first=only&second=cureabout&third=theparam
https://example.com/?fourth=four&fifth=five
https://stack.com/?sixth=six&seventh=seven&eighth=eight
https://stack.com/?sixth=itdoesnt&seventh=matter&eighth=something

I want it to make lines unique based on strings parameters, and print the only one URL with the same parameters, and of course recognize their domains. Values are not important.

The desired result:

https://example.com/?first=one&second=two&third=three
https://stack.com/?sixth=six&seventh=seven&eighth=eight

UPDATE

In the following code I'm trying to grep 3 characters before = and if lines contain that specific character then unique lines and print the result. Actually goal is to make the file unique if they have certain number of similar parameters.

for url in $(cat $1); do

    # COUNT NUMBER OF EQUAL CHARACTER "="
    count_eq=$(echo $url | sed "s/=/=\n/g" | grep -a -c '=')
    if [[ $count_eq == "3" ]]; then

        # GREP 3 CHARACTERS BEFORE "="
        same_param=$(printf $url | grep -o -P '.{0,3}=.{0,0}' | sort -u)
    
        if [[ $url == *"$same_param"* ]];then
            sort -u "$url" | printf "$url\n"
        fi
    fi

done

Thanks.

Please do add your efforts(in form of code) in your question, which is highly encouraged on SO, to avoid downvotes, close votes on your question, thank you. — RavinderSingh13
– RavinderSingh13, Commented Apr 8, 2021 at 6:54

Bajajsahab · Accepted Answer · 2021-04-08 07:07:25Z

1

You can try below code

awk '!a[$0]++' file

its simply checking if a line is not present in Array then print it

answered Apr 8, 2021 at 7:07

Bajajsahab

972 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

sof31 Over a year ago

I just updated my post, could you re-check it?

Bajajsahab Over a year ago

@sof31, it will still work , please check and let me know.

hek2mgl · Accepted Answer · 2021-04-08 08:49:16Z

0

A two step approach might be the simplest to understand.

First print out the values of first and second, along with the urls:

< a.txt awk -F'[?&]' '{for(i=2;i<=NF;i++){split($i,a,"=");p[a[1]]=a[2]};
                      print $1" "p["first"]" "p["second"]}'
https://example.com/ one two
https://example.com/ one two
https://example.com/ one two
https://stack.com/ one two
https://stack.com/ one two

Now change the print statement at the end, and turn it into a filter:

< a.txt awk -F'[?&]' '{for(i=2;i<=NF;i++){split($i,a,"=");p[a[1]]=a[2]};
                      !seen[$1""p["first"]""p["second"]]++'
https://example.com/?first=one&second=two&third=three
https://stack.com/?sixth=six&seventh=seven&eighth=eight

In comments you asked for a generic solution, which takes into account every parameter, not just first and second.

I would use Python for this:

#!/usr/bin/python3

# test.py

import sys
from urllib.parse import urlparse, parse_qsl

seen = {}
for line in sys.stdin:
    url = urlparse(line.strip())
    # create a search lookup of sorted parameters, scheme and domain
    sorted_params = sorted(parse_qsl(url.query), key=lambda x:x[0])
    check_str = '{}://{}?{}'.format(
        url.scheme,
        url.netloc,
        '&'.join(['='.join(p) for p in sorted_params]),
    )
    # check if this combination of parameters and values has been seen before
    if check_str not in seen:
        seen[check_str] = 1
        print(line.strip())

Run it like this:

< input.file python3 test.py

edited Apr 8, 2021 at 8:49

answered Apr 8, 2021 at 7:36

hek2mgl

159k31 gold badges263 silver badges279 bronze badges

6 Comments

sof31 Over a year ago

I have two problems with this code: Assume that there are several lines of URLs with different parameters. In the code above I should specify the parameters first, second. 2- it removes = and the parameter behind. I want it to be printed completely, like: ``` https:// example.com/?first=one&second=two&third=three ```

hek2mgl Over a year ago

make sure to run the code as I've posted it. And use the final, second version at the bottom

sof31 Over a year ago

I read your update late, so consider the first problem. Thanks.

hek2mgl Over a year ago

In your question you explicitly mentioned first and second. For a generic solution, taking into account all possible url parameters, I would suggest something other than awk, like Python. Are you ok with Python?

sof31 Over a year ago

I don't know python a lot, I only want to get the job done, no matter with what language.

|

Collectives™ on Stack Overflow

remove duplicate lines based on specific strings

2 Answers 2

2 Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related