0

I am trying to build a dict(dict(dict())) out of multiple files, which are stored in different numbered directories, i.e.

/data/server01/datafile01.dat
/data/server01/datafile02.dat
...
/data/server02/datafile01.dat
/data/server02/datafile02.dat
...
/data/server86/datafile01.dat
...
/data/server86/datafile99.dat

I have a couple problems at the moment:

  1. Switching between directories

I know that I have 86 servers, but the number of files per server may vary. I am using:

for i in range(1,86):
    basedir='/data/server%02d' % i
    for file in glob.glob(basedir+'*.dat'):
        Do reading and sorting here

but I cant seem to switch between the directories properly. It just sits in the first one and gets stuck it seems when there are no files in the directory

  1. Checking if key already exists

I would like to have a function that somehow checks if a key is already present or not, and in case it isnt creates that key and certain subkeys, since one cant define dict[Key1][Subkey1][Subsubkey1]=value

BTW i am using Python 2.6.6

5
  • 1
    You are missing a / between basedir and your glob. Commented Jan 21, 2012 at 19:01
  • You should use os.path.join to build paths. Commented Jan 21, 2012 at 19:02
  • Also don't shadow the builtin file. Commented Jan 21, 2012 at 19:02
  • Are your two questions related? Commented Jan 21, 2012 at 19:02
  • They came up in the same code. I am just trying not to open too many threads Commented Jan 22, 2012 at 5:31

3 Answers 3

2

Björn helped with the defaultdict half of your question. His suggestion should get you very close to where you want to be in terms of the default value for keys that do not yet exist.

The best tool for walking a directory and looking at files is os.walk. You can combine the directory and filename names that you get from it with os.path.join to find the files you are interested in. Something like this:

import os

data_path = '/data'

# option 1 using nested list comprehensions**
data_files = (os.path.join(root,f) for (root, dirs, files) in os.walk(data_path)
                                   for f in files)   # can use [] instead of ()

# option 2 using nested for loops
data_files = []
for root, dirs, files in os.walk(data_path):
    for f in files:
        data_files.append(os.path.join(root, f))

for data_file in data_files:
    # ... process data_file ...

**Docs for list comprehensions.

Sign up to request clarification or add additional context in comments.

Comments

2

I can't help you with your first problem, but the second one can be solved by using a defaultdict. This is a dictionary that has a function that is called to generate a value when a requested key did not exist. Using lambda you can nest them:

>>> your_dict = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
>>> your_dict[1][2][3]
0

1 Comment

For infinite nesting you can use def default(): return defaultdict(default). Recursion!
1

I'm assuming these 'directories' are remotely mounted shares?

Couple of things:

  1. I'd use os.path.join instead of 'basedir' + '*.dat'
  2. For FS related stuff I've had very good results parallelizing the computation using multiprocessing.Pool to get around those times where a remote fs might be extremely slow and hold up the whole process.

    import os
    import glob
    import multiprocessing as mp
    
    def processDir(path):
        results = {} 
        for file in glob.iglob(os.path.join(path,'*.dat')):
             results.update(add to the results here)
        return results
    
    dirpaths = ['/data/server%02d'%i for i in range(1,87)]
    _results = mp.Pool(8).map(processDir,dirpaths)
    results  = combine _results here...
    

For your dict-related problems, use defaultdict, as mentioned in the other answers, or even your own dict subclass, or function?

def addresult(results,key,subkey,subsubkey,value):
    if key not in results:
        results[key] = {}
    if subkey not in results[key]:
        results[key][subkey] = {}
    if subsubkey not in results[key][subkey]:
        results[key][subkey][subsubkey] = value

There are almost certainly more efficient ways to accomplish this, but that's a start.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.