Fastest/most optimized way to read/write ASCII file in python

Question

Let me update my question, I have an ascii file(7G) which has around 100M lines. I read this file using :

f=np.loadtxt(os.path.join(dir,myfile),delimiter=None,skiprows=0) 

x=f[:,1] 
y=f[:,2] 
z=f[:,3] 
id=f[:,0]

I will need the x,y,z and id arrays later for interpolations. The problem is reading the file takes around 80 min while the interpolation only takes 15 min.

I tried to get the memory increment used by each line of the script using python memory_profiler module.

The following line which reads the entire 7.4 GB file increments the memory usage by 3206.898 MiB (3.36 GB). First question is Why it does not increment the memory usage by 7.4 GB?

f=np.loadtxt(os.path.join(dir,myfile),delimiter=None,skiprows=0)

The following 4 lines do not increment the memory at all.

x=f[:,1] 
y=f[:,2] 
z=f[:,3] 
id=f[:,0]

Finally I still would appreciate if you could recommend me what is the most optimized way to read/write to files in python? are numpy np.loadtxt and np.savetxt the best?

Thanks in Advance,

you're reading text files but it's converted to numerical binary data: binary weighs less than text. You should consider store your files as binary, using scipy or a custom format (maybe pickle would do). You'll save time. — Jean-François Fabre
– Jean-François Fabre ♦, Commented Nov 29, 2016 at 14:25
"Why it does not increment the memory usage by 7.4 GB" - because the string "1.2345667892323" uses up more space in memory than the 8 bytes needed by a double — Eric
– Eric, Commented Nov 29, 2016 at 14:51
Thanks. Converting my text file to binary pickles reduces the load time next time from 80(min) to 2(s) — Heli
– Heli, Commented Nov 30, 2016 at 15:07

Eric · Accepted Answer · 2016-11-29 15:01:56Z

4

The most optimal way to write numeric data to a file, is to not write it to an ASCII file.

Run this once to store your data in binary with np.save (which essentially is the same as pickleing):

np_file = os.path.splitext(myfile)[0] + '.npy'
data = np.loadtxt(os.path.join(dir,myfile),delimiter=None,skiprows=0)
np.save(os.path.join(dir, np_file), data)

Then you can load it next time as:

data = np.load(os.path.join(dir, np_file))

edited Nov 29, 2016 at 15:01

answered Nov 29, 2016 at 14:54

Eric

98.1k54 gold badges257 silver badges389 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Heli Over a year ago

Thanks. Saving the data npy binary redunces the load time next time from 80(min) to 2(s). Amazing!

hpaulj · Accepted Answer · 2016-11-29 16:57:46Z

savetxt and loadtxt just write and read the files line by line. Save is essentially:

with open(...) as f:
   for row in arr:
       f.write(fmt % tuple(row))

where fmt has a % format for each column of the arr.

Load is essentially

alist = []
for row in f:  # ie f.readline()
    line = row.split(delimiter)
    <convert types>
    alist.append(line)
np.array(alist)

It collects all the values of the text file in a list of lists, and converts that to an array once, at the end.

An expression like x=f[:,0] doesn't change memory usage, since x is a view of f - (check docs on views vs. copies).

These numpy functions work fine for modest size files, but increasingly people are using this code for large datasets - texts or data mining.

Collectives™ on Stack Overflow

Fastest/most optimized way to read/write ASCII file in python

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related