How to remove duplicates in my python script?

Question

I have used a regex search to filter down some results from a text file (searching for ".js") which has given me roughly around 16 results some of which are duplicates. I want to remove duplicates from that output and print either onto the console or redirect it into a file. I have attempted the use of sets and dictionary.fromkeys with no success! Here is what I have at the moment, thank you in advance:

#!/usr/bin/python

import re
import sys

pattern = re.compile("[^/]*\.js")

for i, line in enumerate(open('access_log.txt')):
    for match in re.findall(pattern, line):
        x = str(match)
        print x

As an aside, you are using python 2 which is end of life. Move to python 3 if you can. — tdelaney
– tdelaney, Commented May 17, 2020 at 0:35
Welcome to SO! Check out the tour and How to Ask if you want advice. SO is not a code-writing service, so please post your best attempt, even if it didn't work. For reference see How do I ask and answer homework questions?. By the way, Python 2 hit end of life in January, so unless you need it for a job or something, stop learning it and learn Python 3 instead. Python 3 is much better. — wjandrea
– wjandrea, Commented May 17, 2020 at 0:37
Just put the matches into a list then see Removing duplicates in lists — wjandrea
– wjandrea, Commented May 17, 2020 at 0:40
Restricted to python2 unfortunately, working on a VM created by someone else with explicit request to use python 2. — danjl
– danjl, Commented May 17, 2020 at 0:42
You were on the right track with sets, that is the right data structure for the problem of adding values to a collection and testing whether a value already exists in that collection. One more point of advice: when you're reading a file, you typically want to use the with open(...) pattern -- this ensures the file is closed when you're done with it, even if an error occurs. — grayshirt
– grayshirt, Commented May 17, 2020 at 0:47

Erwin Zangwill · Accepted Answer · 2020-05-17 00:33:53Z

1

Why set wouldn't work, what was wrong there? Did you try it as below?

import re
import sys

pattern = re.compile("[^/]*\.js")
results = set()

for i, line in enumerate(open('access_log.txt')):
    for match in re.findall(pattern, line):
        results.add(str(match))

answered May 17, 2020 at 0:33

Erwin Zangwill

478 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Booboo · Accepted Answer · 2020-05-17 12:30:54Z

Using sets to eliminate duplicates:

#!/usr/bin/python

import re

pattern = re.compile("[^/]*\.js")

matches = set()
with open('access_log.txt') as f:
    for line in f:
        for match in re.findall(pattern, line):
            #x = str(match) # or just use match
            if match not in in matches:
                print match
                matches.add(match)

But I question your regex:

You are doing a findall on each line, which suggests that each line might have multiple "hits", such as:

file1.js file2.js file3.js

But in your regex:

[^/]*\.js

[^/]* is doing a greedy match and would return only one match, namely the complete line.

If you made the match non-greedy, i.e. [^/]*?, then you would get 3 matches:

'file1.js'
' file2.js'
' file3.js'

But that highlights another potential problem. Do you really want those spaces in the second and third matches for these particular cases? Perhaps in the case of /abc/ def.js you would keep the leading blank that follows /abc/.

So I would suggest:

#!/usr/bin/python

import re

pattern = re.compile("""
    (?x)            # verbose mode
    (?:             # first alternative:
        (?<=/)      # positive lookbehind assertion: preceded by '/'
        [^/]*?      # matches non-greedily 0 or more non-'/'
    |               # second alternative
        (?<!/)      # negative lookbehind assertion: not preceded by '/'
        [^/\s]*?    # matches non-greedily 0 or more non-'/' or non-whitespace
    )
    \.js            # matches '.js'
    """)

matches = set()
with open('access_log.txt') as f:
    for line in f:
        for match in pattern.findall(line):
            if match not in matches:
                print match
                matches.add(match)

If the filename cannot have any whitespace, then just use:

pattern = re.compile("[^\s/]*?\.js")

Collectives™ on Stack Overflow

How to remove duplicates in my python script?

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related