1

I am trying to populate a list in Python3 with 3 random items being read from a file using REGEX, however i keep getting duplicate items in the list. Here is an example.

import re
import random as rn

data = '/root/Desktop/Selenium[FILTERED].log'
with open(data, 'r') as inFile:
    index = inFile.read()
    URLS = re.findall(r'https://www\.\w{1,10}\.com/view\?i=\w{1,20}', index)

    list_0 = []
    for i in range(3):
        list_0.append(URLS[rn.randint(1, 30)])
    inFile.close()

for i in range(len(list_0)):
    print(list_0[i])

What would be the cleanest way to prevent duplicate items being appended to the list?

(EDIT) This is the code that i think has done the job quite well.

def random_sample(data):
    r_e = ['https://www\.\w{1,10}\.com/view\?i=\w{1,20}', '..']
    with open(data, 'r') as inFile:
        urls = re.findall(r'%s' % r_e[0], inFile.read())
        x = list(set(urls))
        inFile.close()
    return x

data = '/root/Desktop/[TEMP].log'
sample = random_sample(data)
for i in range(3):
    print(sample[i])

Unordered collection with no duplicate entries.

2 Answers 2

3

Use the builtin random.sample.

random.sample(population, k)
    Return a k length list of unique elements chosen from the population sequence or set.
    Used for random sampling without replacement.

Addendum

After seeing your edit, it looks like you've made things much harder than they have to be. I've wired a list of URLS in the following, but the source doesn't matter. Selecting the (guaranteed unique) subset is essentially a one-liner with random.sample:

import random

# the following two lines are easily replaced
URLS = ['url1', 'url2', 'url3', 'url4', 'url5', 'url6', 'url7', 'url8']
SUBSET_SIZE = 3

# the following one-liner yields the randomized subset as a list
urlList = [URLS[i] for i in random.sample(range(len(URLS)), SUBSET_SIZE)]
print(urlList)    # produces, e.g., => ['url7', 'url3', 'url4']

Note that by using len(URLS) and SUBSET_SIZE, the one-liner that does the work is not hardwired to the size of the set nor the desired subset size.


Addendum 2

If the original list of inputs contains duplicate values, the following slight modification will fix things for you:

URLS = list(set(URLS))  # this converts to a set for uniqueness, then back for indexing
urlList = [URLS[i] for i in random.sample(range(len(URLS)), SUBSET_SIZE)]

Or even better, because it doesn't need two conversions:

URLS = set(URLS)
urlList = [u for u in random.sample(URLS, SUBSET_SIZE)]
Sign up to request clarification or add additional context in comments.

8 Comments

Once i figure out how to make random.sample return an integer, randValue = URLS[rn.sample(range(30), k=1)[0]] it seems to work well, thanks.
Okay, now after a few runs, it seems the list was not being populated with unique URLS values, it was still giving duplicate items in the list.. So i've had to revert back to the original fix for now.
If you want uniqueness across runs, you need to store state. If you want uniqueness across the entire set, shuffle (or sample) the full list and iterate through it. Obviously it's impossible to get uniqueness if you want to poll more items than are on the list.
Seems even on a single run the program is still occasionally returning duplicate items in the list on that pass. I pasted the code as is into pycharm IDE, and changed the list to re.findall, changed subset to 5. And the result was similar to what i had when i first posted the question.
Just want to let you know. Now i've paid closer attention, there is duplicate URL's in the source file, so i assume this is the reason the method you suggested does not work for my needs. As the values being pulled from the function are unique, however it's not checking for literal duplicates being appended to the list.
|
1
seen = set(list_0)
randValue = URLS[rn.randint(1, 30)]

# [...]

if randValue not in seen:
  seen.add(randValue)
  list_0.append(randValue)

Now you just need to check list_0 size is equal to 3 to stop the loop.

1 Comment

That makes sense, i will have to look more into data structure and such, and keep it in mind. I just hit a wall and could not find an answer with a brief search. Thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.