1

I have the following code for producing a big text file:

d = 3
n = 100000
f = open("input.txt",'a')
s = ""
for j in range(0, d-1):
    s += str(round(random.uniform(0,1000), 3))+" "
s += str(round(random.uniform(0,1000), 3))
f.write(s)
for i in range(0, n-1):
    s = ""
    for j in range(0, d-1):
        s += str(round(random.uniform(0,1000), 3))+" "
    s += str(round(random.uniform(0,1000), 3))
    f.write("\n"+s)
f.close()

But it seems to be pretty slow to even generate 5GB of this.

How can I make it better? I wish the output to be like:

796.802 691.462 803.664
849.483 201.948 452.155
144.174 526.745 826.565
986.685 238.462 49.885
137.617 416.243 515.474
366.199 687.629 423.929
2
  • 3
    One obvious fault is that you first concatenate the data into a string and then write the whole string to the file. Further, as for any performance question, you need to use a profiler to find out where most time is spent. Commented Dec 6, 2015 at 14:36
  • @UlrichEckhardt what would you suggest? I need that each element of line to be separated by a blank space. Commented Dec 6, 2015 at 14:37

4 Answers 4

2

Well, of course, the whole thing is I/O bound. You can't output the file faster than the storage device can write it. Leaving that aside, there are some optimizations that could be made.

Your method of building up a long string from several shorter strings is suboptimal. You're saying, essentially, s = s1 + s2. When you tell Python to do this, it concatenates two string objects to make a new string object. This is slow, especially when repeated.

A much better way is to collect the individual string objects in a list or other iterable, then use the join method to run them together. For example:

>>> ''.join(['a', 'b', 'c'])
'abc'
>>> ', '.join(['a', 'b', 'c'])
'a, b, c'

Instead of n-1 string concatenations to join n strings, this does the whole thing in one step.

There's also a lot of repeated code that could be combined. Here's a cleaner design, still using the loops.

import random

d = 3
n = 1000

f = open('input.txt', 'w')

for i in range(n):
    nums = []
    for j in range(d):
        nums.append(str(round(random.uniform(0, 1000), 3)))
    s = ' '.join(nums)
    f.write(s)
    f.write('\n')

f.close()

A cleaner, briefer, more Pythonic way is to use a list comprehension:

import random

d = 3
n = 1000

f = open('input.txt', 'w')

for i in range(n):
    nums = [str(round(random.uniform(0, 1000), 3)) for j in range(d)]
    f.write(' '.join(nums))
    f.write('\n')

f.close()

Note that in both cases, I wrote the newline separately. That should be faster than concatenating it to the string, since I/O is buffered anyway. If I were joining a list of strings without separators, I'd just tack on a newline as the last string before joining.

As Daniel's answer says, numpy is probably faster, but maybe you don't want to get into numpy yet; it sounds like you're kind of a beginner at this point.

Sign up to request clarification or add additional context in comments.

Comments

2

Using numpy is probably faster:

import numpy
d = 3
n = 100000
data = numpy.random.uniform(0, 1000,size=(n,d))
numpy.savetxt("input.txt", data, fmt='%.3f')

2 Comments

data has to be in memory. Might be a problem for really large files.
Could break it into large chunks. Still, I think the main problem is that it's I/O bound.
1

This could be a bit faster:

nlines = 100000
col = 3
for line in range(nlines):
    f.write('{} {} {}\n'.format(*((round(random.uniform(0,1000), 3))
                                  for e in range(col))))

or use string formatting:

for line in range(nlines):
    numbers = [random.uniform(0, 1000) for e in range(col)]
    f.write('{:6.3f} {:6.3f} {:6.3f}\n'.format(*numbers))

1 Comment

Not quite the same output format; using string formatting in this way will leave trailing zeros after the decimal point, where round does not. Some of the sample output had only two decimals, before I shortened it.
0

I guess its better if you want to use a infinite loop and want to make a so big file without limitation the better is use like that

import random

d = 3
n = 1000

f = open('input.txt', 'w')

for i in range(10**9):
    nums = [str(round(random.uniform(0, 1000), 3)) for j in range(d)]
    f.write(' '.join(nums))
    f.write('\n')

f.close()

The code will not stopped while you click on ctr-c

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.