Python - Efficiently building a dictionary

Question

I am trying to build a dict(dict(dict())) out of multiple files, which are stored in different numbered directories, i.e.

/data/server01/datafile01.dat
/data/server01/datafile02.dat
...
/data/server02/datafile01.dat
/data/server02/datafile02.dat
...
/data/server86/datafile01.dat
...
/data/server86/datafile99.dat

I have a couple problems at the moment:

Switching between directories

I know that I have 86 servers, but the number of files per server may vary. I am using:

for i in range(1,86):
    basedir='/data/server%02d' % i
    for file in glob.glob(basedir+'*.dat'):
        Do reading and sorting here

but I cant seem to switch between the directories properly. It just sits in the first one and gets stuck it seems when there are no files in the directory

Checking if key already exists

I would like to have a function that somehow checks if a key is already present or not, and in case it isnt creates that key and certain subkeys, since one cant define dict[Key1][Subkey1][Subsubkey1]=value

BTW i am using Python 2.6.6

They came up in the same code. I am just trying not to open too many threads — madtowneast
– madtowneast, Commented Jan 22, 2012 at 5:31

istruble · Accepted Answer · 2012-01-23 17:04:58Z

2

Björn helped with the defaultdict half of your question. His suggestion should get you very close to where you want to be in terms of the default value for keys that do not yet exist.

The best tool for walking a directory and looking at files is os.walk. You can combine the directory and filename names that you get from it with os.path.join to find the files you are interested in. Something like this:

import os

data_path = '/data'

# option 1 using nested list comprehensions**
data_files = (os.path.join(root,f) for (root, dirs, files) in os.walk(data_path)
                                   for f in files)   # can use [] instead of ()

# option 2 using nested for loops
data_files = []
for root, dirs, files in os.walk(data_path):
    for f in files:
        data_files.append(os.path.join(root, f))

for data_file in data_files:
    # ... process data_file ...

**Docs for list comprehensions.

edited Jan 23, 2012 at 17:04

answered Jan 21, 2012 at 19:32

istruble

13.8k2 gold badges51 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Björn Pollex · Accepted Answer · 2012-01-21 18:58:41Z

2

I can't help you with your first problem, but the second one can be solved by using a defaultdict. This is a dictionary that has a function that is called to generate a value when a requested key did not exist. Using lambda you can nest them:

>>> your_dict = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
>>> your_dict[1][2][3]
0

answered Jan 21, 2012 at 18:58

Björn Pollex

77.1k30 gold badges206 silver badges290 bronze badges

1 Comment

Jasmijn Over a year ago

For infinite nesting you can use def default(): return defaultdict(default). Recursion!

Cyclone · Accepted Answer · 2012-01-21 19:34:05Z

I'm assuming these 'directories' are remotely mounted shares?

Couple of things:

I'd use os.path.join instead of 'basedir' + '*.dat'

For FS related stuff I've had very good results parallelizing the computation using multiprocessing.Pool to get around those times where a remote fs might be extremely slow and hold up the whole process.

import os
import glob
import multiprocessing as mp

def processDir(path):
    results = {} 
    for file in glob.iglob(os.path.join(path,'*.dat')):
         results.update(add to the results here)
    return results

dirpaths = ['/data/server%02d'%i for i in range(1,87)]
_results = mp.Pool(8).map(processDir,dirpaths)
results  = combine _results here...

For your dict-related problems, use defaultdict, as mentioned in the other answers, or even your own dict subclass, or function?

def addresult(results,key,subkey,subsubkey,value):
    if key not in results:
        results[key] = {}
    if subkey not in results[key]:
        results[key][subkey] = {}
    if subsubkey not in results[key][subkey]:
        results[key][subkey][subsubkey] = value

There are almost certainly more efficient ways to accomplish this, but that's a start.

Collectives™ on Stack Overflow

Python - Efficiently building a dictionary

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related