How to create a big file quickly with Python

Question

I have the following code for producing a big text file:

d = 3
n = 100000
f = open("input.txt",'a')
s = ""
for j in range(0, d-1):
    s += str(round(random.uniform(0,1000), 3))+" "
s += str(round(random.uniform(0,1000), 3))
f.write(s)
for i in range(0, n-1):
    s = ""
    for j in range(0, d-1):
        s += str(round(random.uniform(0,1000), 3))+" "
    s += str(round(random.uniform(0,1000), 3))
    f.write("\n"+s)
f.close()

But it seems to be pretty slow to even generate 5GB of this.

How can I make it better? I wish the output to be like:

796.802 691.462 803.664
849.483 201.948 452.155
144.174 526.745 826.565
986.685 238.462 49.885
137.617 416.243 515.474
366.199 687.629 423.929

One obvious fault is that you first concatenate the data into a string and then write the whole string to the file. Further, as for any performance question, you need to use a profiler to find out where most time is spent. — Ulrich Eckhardt
– Ulrich Eckhardt, Commented Dec 6, 2015 at 14:36
@UlrichEckhardt what would you suggest? I need that each element of line to be separated by a blank space. — member555
– member555, Commented Dec 6, 2015 at 14:37

Tom Zych · Accepted Answer · 2015-12-06 15:01:13Z

Well, of course, the whole thing is I/O bound. You can't output the file faster than the storage device can write it. Leaving that aside, there are some optimizations that could be made.

Your method of building up a long string from several shorter strings is suboptimal. You're saying, essentially, s = s1 + s2. When you tell Python to do this, it concatenates two string objects to make a new string object. This is slow, especially when repeated.

A much better way is to collect the individual string objects in a list or other iterable, then use the join method to run them together. For example:

>>> ''.join(['a', 'b', 'c'])
'abc'
>>> ', '.join(['a', 'b', 'c'])
'a, b, c'

Instead of n-1 string concatenations to join n strings, this does the whole thing in one step.

There's also a lot of repeated code that could be combined. Here's a cleaner design, still using the loops.

import random

d = 3
n = 1000

f = open('input.txt', 'w')

for i in range(n):
    nums = []
    for j in range(d):
        nums.append(str(round(random.uniform(0, 1000), 3)))
    s = ' '.join(nums)
    f.write(s)
    f.write('\n')

f.close()

A cleaner, briefer, more Pythonic way is to use a list comprehension:

import random

d = 3
n = 1000

f = open('input.txt', 'w')

for i in range(n):
    nums = [str(round(random.uniform(0, 1000), 3)) for j in range(d)]
    f.write(' '.join(nums))
    f.write('\n')

f.close()

Note that in both cases, I wrote the newline separately. That should be faster than concatenating it to the string, since I/O is buffered anyway. If I were joining a list of strings without separators, I'd just tack on a newline as the last string before joining.

As Daniel's answer says, numpy is probably faster, but maybe you don't want to get into numpy yet; it sounds like you're kind of a beginner at this point.

Daniel · Accepted Answer · 2015-12-06 14:45:35Z

2

Using numpy is probably faster:

import numpy
d = 3
n = 100000
data = numpy.random.uniform(0, 1000,size=(n,d))
numpy.savetxt("input.txt", data, fmt='%.3f')

answered Dec 6, 2015 at 14:45

Daniel

42.9k4 gold badges57 silver badges82 bronze badges

2 Comments

Mike Müller Over a year ago

data has to be in memory. Might be a problem for really large files.

Tom Zych Over a year ago

Could break it into large chunks. Still, I think the main problem is that it's I/O bound.

Mike Müller · Accepted Answer · 2015-12-06 14:47:03Z

1

This could be a bit faster:

nlines = 100000
col = 3
for line in range(nlines):
    f.write('{} {} {}\n'.format(*((round(random.uniform(0,1000), 3))
                                  for e in range(col))))

or use string formatting:

for line in range(nlines):
    numbers = [random.uniform(0, 1000) for e in range(col)]
    f.write('{:6.3f} {:6.3f} {:6.3f}\n'.format(*numbers))

edited Dec 6, 2015 at 14:47

answered Dec 6, 2015 at 14:38

Mike Müller

86k21 gold badges174 silver badges165 bronze badges

1 Comment

Tom Zych Over a year ago

Not quite the same output format; using string formatting in this way will leave trailing zeros after the decimal point, where round does not. Some of the sample output had only two decimals, before I shortened it.

Ahmed talaat · Accepted Answer · 2021-01-23 10:43:11Z

0

I guess its better if you want to use a infinite loop and want to make a so big file without limitation the better is use like that

import random

d = 3
n = 1000

f = open('input.txt', 'w')

for i in range(10**9):
    nums = [str(round(random.uniform(0, 1000), 3)) for j in range(d)]
    f.write(' '.join(nums))
    f.write('\n')

f.close()

The code will not stopped while you click on ctr-c

answered Jan 23, 2021 at 10:43

Ahmed talaat

1

Collectives™ on Stack Overflow

How to create a big file quickly with Python

4 Answers 4

Comments

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related