0

I have used a regex search to filter down some results from a text file (searching for ".js") which has given me roughly around 16 results some of which are duplicates. I want to remove duplicates from that output and print either onto the console or redirect it into a file. I have attempted the use of sets and dictionary.fromkeys with no success! Here is what I have at the moment, thank you in advance:

#!/usr/bin/python

import re
import sys

pattern = re.compile("[^/]*\.js")

for i, line in enumerate(open('access_log.txt')):
    for match in re.findall(pattern, line):
        x = str(match)
        print x
7
  • 1
    As an aside, you are using python 2 which is end of life. Move to python 3 if you can. Commented May 17, 2020 at 0:35
  • Welcome to SO! Check out the tour and How to Ask if you want advice. SO is not a code-writing service, so please post your best attempt, even if it didn't work. For reference see How do I ask and answer homework questions?. By the way, Python 2 hit end of life in January, so unless you need it for a job or something, stop learning it and learn Python 3 instead. Python 3 is much better. Commented May 17, 2020 at 0:37
  • Just put the matches into a list then see Removing duplicates in lists Commented May 17, 2020 at 0:40
  • Restricted to python2 unfortunately, working on a VM created by someone else with explicit request to use python 2. Commented May 17, 2020 at 0:42
  • 1
    You were on the right track with sets, that is the right data structure for the problem of adding values to a collection and testing whether a value already exists in that collection. One more point of advice: when you're reading a file, you typically want to use the with open(...) pattern -- this ensures the file is closed when you're done with it, even if an error occurs. Commented May 17, 2020 at 0:47

2 Answers 2

1

Why set wouldn't work, what was wrong there? Did you try it as below?

import re
import sys

pattern = re.compile("[^/]*\.js")
results = set()

for i, line in enumerate(open('access_log.txt')):
    for match in re.findall(pattern, line):
        results.add(str(match))
Sign up to request clarification or add additional context in comments.

Comments

0

Using sets to eliminate duplicates:

#!/usr/bin/python

import re

pattern = re.compile("[^/]*\.js")

matches = set()
with open('access_log.txt') as f:
    for line in f:
        for match in re.findall(pattern, line):
            #x = str(match) # or just use match
            if match not in in matches:
                print match
                matches.add(match)

But I question your regex:

You are doing a findall on each line, which suggests that each line might have multiple "hits", such as:

file1.js file2.js file3.js

But in your regex:

[^/]*\.js

[^/]* is doing a greedy match and would return only one match, namely the complete line.

If you made the match non-greedy, i.e. [^/]*?, then you would get 3 matches:

'file1.js'
' file2.js'
' file3.js'

But that highlights another potential problem. Do you really want those spaces in the second and third matches for these particular cases? Perhaps in the case of /abc/ def.js you would keep the leading blank that follows /abc/.

So I would suggest:

#!/usr/bin/python

import re

pattern = re.compile("""
    (?x)            # verbose mode
    (?:             # first alternative:
        (?<=/)      # positive lookbehind assertion: preceded by '/'
        [^/]*?      # matches non-greedily 0 or more non-'/'
    |               # second alternative
        (?<!/)      # negative lookbehind assertion: not preceded by '/'
        [^/\s]*?    # matches non-greedily 0 or more non-'/' or non-whitespace
    )
    \.js            # matches '.js'
    """)

matches = set()
with open('access_log.txt') as f:
    for line in f:
        for match in pattern.findall(line):
            if match not in matches:
                print match
                matches.add(match)

If the filename cannot have any whitespace, then just use:

pattern = re.compile("[^\s/]*?\.js")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.