2

I am trying to read a large number of small binary format files (~200,000) as numpy arrays into a dictionary in python:

import os
import numpy as np

def readfiles(limit):
    filelist = {}
    i=1
    for filename in os.listdir('folder'):
        filelist[filename] = np.fromfile('folder/'+filename, 'float32')
        i += 1
        if i > limit:
           break

    return filelist

The limit argument is just for testing with a smaller number of files, normally I would read all the files in the folder.

The first time I run the script with a fairly large limit (90,000), it takes ~68 s. If I immediately re-run the script it runs in ~1.2 s. The cProfiles give:

>>> cProfile.run('readfiles(90000)')

90005 function calls in 68.768 seconds
Ordered by: standard name
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.284    0.284   68.690   68.690 <ipython-input-57-939c6a92cd68>:1(readfiles)
    1    0.079    0.079   68.768   68.768 <string>:1(<module>)
    1    0.000    0.000   68.768   68.768 {built-in method builtins.exec}
90000   68.313    0.001   68.313    0.001 {built-in method numpy.core.multiarray.fromfile}
    1    0.093    0.093    0.093    0.093 {built-in method posix.listdir}
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


>>> cProfile.run('readfiles(90000)')

90005 function calls in 1.970 seconds
Ordered by: standard name
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.137    0.137    1.900    1.900 <ipython-input-57-939c6a92cd68>:1(readfiles)
    1    0.070    0.070    1.970    1.970 <string>:1(<module>)
    1    0.000    0.000    1.970    1.970 {built-in method builtins.exec}
90000    1.673    0.000    1.673    0.000 {built-in method numpy.core.multiarray.fromfile}
    1    0.090    0.090    0.090    0.090 {built-in method posix.listdir}
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Subsequently, when I rerun the script in a completely different session, I still get ~1.2s. This seems rather strange to me. It seems that np.fromfile is not truly re-reading files after it has done once, but reading of some cached files the second time. But I have not heard of cached data being reused in another session in a situation like this. Is that right? If yes, how do I change this so that the code actually re-reads the files? If not, why does the first run takes so long?

I am using Python 3.5.1 with NumPy 1.11.2

Edit: By restarting the system I get the longer runtime back, so this must be a OS-level caching, as pointed out in the comments. Any way to around that without rebooting my system?

7
  • 1
    How completely different is your new session? Because I think there is some caching going on on the OS / filesystem level which would survive, say, your just starting a new python interpreter. No expert, though. Commented Feb 7, 2017 at 12:35
  • I only closed all open terminals and terminated all interactive python sessions. I didn't reboot the system. The code is supposed to run on a cluster so rebooting is not an option, although I can try and see if that helps on my machine. Commented Feb 7, 2017 at 12:46
  • Yup! Restarting seems to clear whatever the cache was. I will edit the question with this information Commented Feb 7, 2017 at 12:55
  • Which operating system are you running this on? Commented Feb 7, 2017 at 15:05
  • I am using Ubuntu 14.04 Commented Feb 7, 2017 at 15:55

1 Answer 1

0

As mentioned in the comments, the following command clears all disk caches.

sync; echo 3 > /proc/sys/vm/drop_caches
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.