0

How can I remove duplicate lines based on specific strings or characters?

For example, I have a file which contains the followings:

https://example.com/?first=one&second=two&third=three
https://example.com/?first=only&second=cureabout&third=theparam
https://example.com/?fourth=four&fifth=five
https://stack.com/?sixth=six&seventh=seven&eighth=eight
https://stack.com/?sixth=itdoesnt&seventh=matter&eighth=something

I want it to make lines unique based on strings parameters, and print the only one URL with the same parameters, and of course recognize their domains. Values are not important.

The desired result:

https://example.com/?first=one&second=two&third=three
https://stack.com/?sixth=six&seventh=seven&eighth=eight

UPDATE

In the following code I'm trying to grep 3 characters before = and if lines contain that specific character then unique lines and print the result. Actually goal is to make the file unique if they have certain number of similar parameters.

for url in $(cat $1); do

    # COUNT NUMBER OF EQUAL CHARACTER "="
    count_eq=$(echo $url | sed "s/=/=\n/g" | grep -a -c '=')
    if [[ $count_eq == "3" ]]; then

        # GREP 3 CHARACTERS BEFORE "="
        same_param=$(printf $url | grep -o -P '.{0,3}=.{0,0}' | sort -u)
    
        if [[ $url == *"$same_param"* ]];then
            sort -u "$url" | printf "$url\n"
        fi
    fi

done

Thanks.

1
  • Please do add your efforts(in form of code) in your question, which is highly encouraged on SO, to avoid downvotes, close votes on your question, thank you. Commented Apr 8, 2021 at 6:54

2 Answers 2

1

You can try below code

awk '!a[$0]++' file

its simply checking if a line is not present in Array then print it

Sign up to request clarification or add additional context in comments.

2 Comments

I just updated my post, could you re-check it?
@sof31, it will still work , please check and let me know.
0

A two step approach might be the simplest to understand.

First print out the values of first and second, along with the urls:

< a.txt awk -F'[?&]' '{for(i=2;i<=NF;i++){split($i,a,"=");p[a[1]]=a[2]};
                      print $1" "p["first"]" "p["second"]}'
https://example.com/ one two
https://example.com/ one two
https://example.com/ one two
https://stack.com/ one two
https://stack.com/ one two

Now change the print statement at the end, and turn it into a filter:

< a.txt awk -F'[?&]' '{for(i=2;i<=NF;i++){split($i,a,"=");p[a[1]]=a[2]};
                      !seen[$1""p["first"]""p["second"]]++'
https://example.com/?first=one&second=two&third=three
https://stack.com/?sixth=six&seventh=seven&eighth=eight

In comments you asked for a generic solution, which takes into account every parameter, not just first and second.

I would use Python for this:

#!/usr/bin/python3

# test.py

import sys
from urllib.parse import urlparse, parse_qsl

seen = {}
for line in sys.stdin:
    url = urlparse(line.strip())
    # create a search lookup of sorted parameters, scheme and domain
    sorted_params = sorted(parse_qsl(url.query), key=lambda x:x[0])
    check_str = '{}://{}?{}'.format(
        url.scheme,
        url.netloc,
        '&'.join(['='.join(p) for p in sorted_params]),
    )
    # check if this combination of parameters and values has been seen before
    if check_str not in seen:
        seen[check_str] = 1
        print(line.strip())

Run it like this:

< input.file python3 test.py

6 Comments

I have two problems with this code: Assume that there are several lines of URLs with different parameters. In the code above I should specify the parameters first, second. 2- it removes = and the parameter behind. I want it to be printed completely, like: ``` https:// example.com/?first=one&second=two&third=three ```
make sure to run the code as I've posted it. And use the final, second version at the bottom
I read your update late, so consider the first problem. Thanks.
In your question you explicitly mentioned first and second. For a generic solution, taking into account all possible url parameters, I would suggest something other than awk, like Python. Are you ok with Python?
I don't know python a lot, I only want to get the job done, no matter with what language.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.