Generates random entries in a particular format

Question

My code down below generates entries for my program, but it's veryyy slow. I'm looking to generate about 10 million, is there any way to speed it up?

FirstNames, LastNames, and Objects (.txt) are all files with one entry per line

TempList=[]
maxData=10000000 #The maximum amount of entries that can be produced
import random,pickle,time,math,statistics

FirstNames    = './FirstNames.txt'
LastNames     = './LastNames.txt'
Objects     = './Objects.txt'

def rawCount(filename):
    with open(filename, 'rb') as f:
        lines = 1
        buf_size = 1024 * 1024
        read_f = f.raw.read

        buf = read_f(buf_size)
        while buf:
            lines += buf.count(b'\n')
            buf = read_f(buf_size)
        return lines

def randomLine(filename):
    num = int(random.uniform(0, rawCount(filename)))
    with open(filename, 'r') as f:
        for i, line in enumerate(f, 1):
            if i == num:
                break
    return line.strip('\n')

def str_time_prop(start, end, format, prop):

    stime = time.mktime(time.strptime(start, format))
    etime = time.mktime(time.strptime(end, format))

    ptime = stime + prop * (etime - stime)

    return time.strftime(format, time.localtime(ptime))

def random_date(start, end, prop):
    return str_time_prop(start, end, '%m/%d/%Y', prop)

def numCheck(question,low,high):
    global errorState
    errorState = True
    while errorState == True:
        checkString = input(question)
        if len(checkString) == 0:
            print("\nYou have to enter something!\n")
        elif not checkString.isdigit():
            print("\nThat's not a number!\n")
        elif not low <= int(checkString) <= high:
            print("\nThe number must be between "+str(low)+" and "+str(high)+"!\n")
        else:
            errorState = False
    return checkString

def yesNoCheck(question):
    while True:
        sel = input("> ")
        if sel.lower() == "y":
            return True
        elif sel.lower() == "n":
            return False
        else:
            print("\nPlease type either 'y' or 'n'.\n")
            
last_times = []
def get_remaining_time(i, total, time):
    last_times.append(time)
    len_last_t = len(last_times)
    if len_last_t > 500:
        last_times.pop(0)
    mean_t = statistics.median(last_times)
    remain_s_tot = mean_t * (int(total) - i + 1) 
    remain_m = round(remain_s_tot / 60)
    remain_s = round(remain_s_tot % 60)
    #return "Time left: "+str(remain_m)+"m "+str(remain_s)+"s"
    return "Time left: "+str(remain_m)+"m "+str(remain_s)+"s."

#Ordered

MainList=[]
RaffleList=[]
TempList=[]

def addstuff():
    global TempList,MainList
    
    Name = str(randomLine(FirstNames)+" "+randomLine(LastNames))
    Amount = random.choice(range(1,500))
    Datehire = random_date("1/1/2008", "1/1/2030", random.random())
    Datereturn = random_date("2/1/2030", "1/1/2060", random.random())
    RandomObject = str(randomLine(Objects))

    TempList.append(Name) #Customer name
    TempList.append(str(random.choice(range(10000000,99999999)))) #Reciept number
    TempList.append(RandomObject) #Item hired
    TempList.append(str(Amount)) #Item Amount
    TempList.append(Datehire) #Date hired
    TempList.append(Datereturn) #Date returned
    TempList.append(str(math.ceil(int(Amount) / 25))) #Boxes needed
    raffle=str(random.choice(range(1,1000)))
    RaffleList.append(raffle)

    MainList+=[TempList]
    lista=TempList
    TempList=[]
    return lista,raffle

print("Random data generator\nHow many entries do you want?")
copies = numCheck("> ",1,maxData)
last_t = 0
print("Generating entries...\n")
for x in range(1,int(copies)):
    t = time.time()
    lista = addstuff()
    last_t = time.time() - t
    remain = get_remaining_time(x, copies, last_t)
    if x % 250 == 0:
        print(str(x)+")\t"+str(remain))
    


print("\nGeneration done.\n\nDo you want to save? (y/n)")
sel = yesNoCheck("> ")
if sel == True:
    with open('data1.dat', 'wb') as x:
        pickle.dump(MainList, x)
    with open('data2.dat', 'wb') as x:
        pickle.dump(RaffleList, x)
    print("\nSaved.")
    time.sleep(2)


else:
    print("Okay, don't know why you generated but cya!")
    time.sleep(2)

RootTwo · Accepted Answer · 2020-08-02 01:08:16Z

If you really want to know where to look at speeding things up, use a profiler. There is one in the standard library. There are also third party libraries.

My guess is that randomLine() and rawCount() are the biggest time sinks.

rawCount() reads an entire file to determine its size. randomLine() first calls rawCount() and then reads parts of the file again. To randomly select a line, randomLine() reads each entire file an average of 1.5 times and makes two calls to 'open(), two to close()and at least 2 toread()`.

(3 files)(6 function calls)(10 million random records) = a lot (180 million) of calls. That's a lot of I/O.

Instead, read a file into a list once. Then use random.choice() to pick an item. The functionality can be put into a convenient class:

import random

class RandomLineChooser:
    def __init__(self, filename):
        with open(filename) as f:
            self.lines = f.readlines()

    def choose(self):
        return random.choice(self.lines)

firstnames = RandomLineChooser(FirstNames)
lastnames = RandomLineChooser(LastNames)
objects = RandomLineChooser(Objects)

I'll also point out two useful Python libraries:

Faker, which is designed to generate fake data, and
Hypothesis, which is designed for testing, but can be used to generate fake data as well.

Hi, thanks for your answer. Reading the file into a list sounds like it would speed up the process heaps, so I'll try that when I get home. I also appreciate the library suggestions, but I'd like to at least try and make it myself, to improve my skill in python :). — Joyte
– Joyte, Commented Aug 2, 2020 at 1:17

Reinderien · Accepted Answer · 2020-07-30 01:42:16Z

Numpy

Use it, or perhaps its wrapper Pandas. Vectorization with these libraries will get you most of the way to a performant solution. This would replace your pickle.dump, and change the internal format of MainList and RaffleList.

Divmod

Use divmod rather than a separated division and modulation here:

remain_m = round(remain_s_tot / 60)
remain_s = round(remain_s_tot % 60)

Boolean selection

    if sel.lower() == "y":
        return True
    elif sel.lower() == "n":
        return False
    else:
        print("\nPlease type either 'y' or 'n'.\n")

can be

sel = input('> ').lower()
if sel in {'y', 'n'}:
    return sel == 'y'
print("\nPlease type either 'y' or 'n'.\n")

Randomly-chosen line

randomLine does not need to iterate at all. Instead, assuming that the line lengths are (within reason) on the same order of magnitude, you can simply

Get the length of the file
Seek to a random position in the file
Read a buffer large enough to probably contain a newline
Consume to that newline
Consume to the next newline, and you have your random line.

Hi, thanks for answering. I'm not exactly sure on how to do the stuff about buffers and all that, but I'm probably just going to use the other guy's solution for putting the whole file into a list, as that seems like it would speed it up a lot. — Joyte
– Joyte, Commented Aug 2, 2020 at 1:14

Stack Exchange Network

Generates random entries in a particular format

2 Answers 2

Numpy

Divmod

Boolean selection

Randomly-chosen line

You must log in to answer this question.

Hot Network Questions

Generates random entries in a particular format

2 Answers 2

Numpy

Divmod

Boolean selection

Randomly-chosen line

You must log in to answer this question.

Related

Hot Network Questions