Working with multiple Large Files in Python

Question

I have around 60 files each contains around 900000 lines which each line is 17 tab separated float numbers. Per each line i need to do some calculation using all corresponding lines from all 60 files, but because of their huge sizes (each file size is 400 MB) and limited computation resources, it takes so long time. I would like to know is there any solution to do this fast?

What do you mean by fast? do you want to process them in parallel or just have a small memory footprint? Do you need to keep the results from each line ? — omu_negru
– omu_negru, Commented Jun 27, 2014 at 8:41

spinus · Accepted Answer · 2014-06-27 08:42:50Z

1

It depends on how you process them. If you have enough memory you can read all the files first and change them to python data structures. Then you can do calculations.

If your files don't fit into memory probably the easiest way is to use some distributed computing mechanism (hadoop or other lighter alternatives).

Another smaller improvements could be to use fadvice linux function call to say how you will be using the file (sequential reading or random access), it tells the operating system how to optimize file access.

If the calculations fit into some common libraries like numpy numexpr which has a lot of optimizations you can use them (this can help if your computations use not-optimized algorithms to process them).

answered Jun 27, 2014 at 8:42

spinus

5,8353 gold badges22 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

bruno desthuilliers · Accepted Answer · 2014-06-27 08:54:01Z

1

If "corresponding lines" means "first lines of all files, then second lines of all files etc", you can use `itertools.izip:

# cat f1.txt
1.1
1.2
1.3

# cat f2.txt
2.1
2.2
2.3

# python
>>> from itertools import izip
>>> files = map(open, ("f1.txt", "f2.txt"))
>>> lines_iterator = izip(*files)
>>> for lines in lines_iterator:
...     print lines
...
('1.1\n', '2.1\n')
('1.2\n', '2.2\n') 
('1.3\n', '2.3\n')
>>>

answered Jun 27, 2014 at 8:54

bruno desthuilliers

78.3k6 gold badges103 silver badges129 bronze badges

Comments

DrV · Accepted Answer · 2014-06-27 09:04:25Z

A few options:

1. Just use the memory

If you have 17x900000 = 15.3 M floats/file. Storing this as doubles (as numpy usually does) will take roughly 120 MB of memory per file. You can reduce this by storing the floats as float32, so that each file will take roughly 60 MB. If you have 60 files and 60 MB/file, you have 3.6 GB of data.

This amount is not unreasonable if you use 64-bit python. If you have less than, say, 6 GB of RAM in your machine, it will result in a lot of virtual memory swapping. Whether or not that is a problem depends on the way you access data.

2. Do it row-by-row

If you can do it row-by-row, just read each file one row at a time. It is quite easy to have 60 open files, that'll not cause any problems. This is probably the most efficient method, if you process the files sequentially. The memory usage is next to nothing, and the operating system will take the trouble of reading the files.

The operating system and the underlying file system try very hard to be efficient in sequential disk reads and writes.

3. Preprocess your files and use mmap

You may also preprocess your files so that they are not in CSV but in a binary format. That way each row will take exactly 17x8 = 136 or 17x4 = 68 bytes in the file. Then you can use numpy.mmap to map the files into arrays of [N, 17] shape. You can handle the arrays as usual arrays, and numpy plus the operating system will take care of optimal memory management.

The preprocessing is required because the record length (number of characters on a row) in a text file is not fixed.

This is probably the best solution, if your data access is not sequential. Then mmap is the fastest method, as it only reads the required blocks from the disk when they are needed. It also caches the data, so that it uses the optimal amount of memory.

Behind the scenes this is a close relative to solution #1 with the exception that nothing is loaded into memory until required. The same limitations about 32-bit python apply; it is not able to do this as it runs out of memory addresses.

The file conversion into binary is relatively fast and easy, almost a one-liner.

Collectives™ on Stack Overflow

Working with multiple Large Files in Python

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related