0

I have a dataframe containing 15K+ strings in the format of xxxx-yyyyy-zzz. The yyyyy is a random 5 digit number generated. Given that I have xxxx as 1000 and zzz as 200, how can I generate the random yyyyy and add it to the dataframe so that the string is unique?

           number
0  1000-12345-100
1  1000-82045-200
2  1000-93035-200
import pandas as pd

data = {"number": ["1000-12345-100", "1000-82045-200", "1000-93035-200"]}
df = pd.DataFrame(data)
print(df)
1
  • I'd generate a list of values between 0 and 99999 and zfill them so they are always at length 5. Then generate the strings (f"1000-{random.choice(list_with_numbers)}-200") and remove that number from the list. Commented Jun 7, 2021 at 21:04

5 Answers 5

1

I'd generate a new column with just the middle values and generate random numbers until you find one that's not in the column.

from random import randint

df["excl"] = df.number.apply(lambda x:int(x.split("-")[1]))

num = randint(10000, 99999)

while num in df.excl.values:
    num = randint(10000, 99999)
    
Sign up to request clarification or add additional context in comments.

1 Comment

This approach is simple, but traversing the list to check uniqueness will be inefficient.
1

I tried to come up with a generic approach, you can use this for lists:

import random

number_series = ["1000-12345-100", "1000-82045-200", "1000-93035-200"]

def rnd_nums(n_numbers: int, number_series: list, max_length: int=5, prefix: int=1000, suffix: int=100):
    # ignore following numbers
    blacklist = [int(x.split('-')[1]) for x in number_series]
    # define space with allowed numbers
    rng = range(0, 10**max_length)
    # get unique sample of length "n_numbers"
    lst = random.sample([i for i in rng if i not in blacklist], n_numbers)
    # return sample as string with pre- and suffix
    return ['{}-{:05d}-{}'.format(prefix, mid, suffix) for mid in lst]

rnd_nums(5, number_series)

Out[69]: 
['1000-79396-100',
 '1000-30032-100',
 '1000-09188-100',
 '1000-18726-100',
 '1000-12139-100']

Or use it to generate new rows in a dataframe Dataframe:

import pandas as pd
data = {"number": ["1000-12345-100", "1000-82045-200", "1000-93035-200"]}
df = pd.DataFrame(data)
print(df)

df.append(pd.DataFrame({'number': rnd_nums(5, number_series)}), ignore_index=True)

Out[72]:
           number
0  1000-12345-100
1  1000-82045-200
2  1000-93035-200
3  1000-00439-100
4  1000-36284-100
5  1000-64592-100
6  1000-50471-100
7  1000-02005-100

Comments

1

In addition to the other suggestions, you could also write a function that takes your df and the amount of new numbers you would like to add as arguments, appends it with the new numbers and returns the updated df. The function could look like this:

import pandas as pd
import random

def add_number(df, num):
    lst = []
    for n in df["number"]:
        n = n.split("-")[1]
        lst.append(int(n))

    for i in range(num):
        check = False
        while check == False:
            new_number = random.randint(10000, 99999)
            if new_number not in lst:
                lst.append(new_number)
                l = len(df["number"])
                df.at[l+1,"number"] = "1000-%i-200" % new_number
                check = True

    df = df.reset_index(drop=True)
    return df

This would have the advantage that you could use the function every time you want to add new numbers.

Comments

0

try:

import random
df['number'] = [f"1000-{x}-200" for x in random.sample(range(10000, 99999), len(df))]

output:

           number
0  1000-24744-200
1  1000-28991-200
2  1000-98322-200
...

Comments

0

One option is to use sample from the random module:

import random
num_digits = 5
col_length = 15000
rand_nums = random.sample(range(10**num_digits),col_length)
data["number"]=['-'.join(
        '1000',str(num).zfill(num_digits),'200') 
    for num in rand_nums]

It took my computer about 30 ms to generate the numbers. For numbers with more digits, it may become infeasible.

Another option is to just take sequential integers, then encrypt them. This will result in a sequence in which each element is unique. They will be pseudo-random, rather than truly random, but then Python's random module is producing pseudo-random numbers as well.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.