Getting and setting dict items via paths

Question

I wanted to have something more convenient to work with nested dictionaries that come from yaml files so I wrote this DictIndexter. Its __getitem__ and __setitem__ accept a path instead of a single key. Additionally the getter can also take a tuple with path and default value. The setter on the other hand is able to create new branches, but only when it's a clean branch. This is a branch assigned to a non-existing key. It wont overwrite another value.

Implementation


class DictIndexer:
    def __init__(self, source: Dict[str, Any], set_creates_branches: bool = True):
        self.source = source
        self.set_creates_branches = set_creates_branches

    def __getitem__(self, path: str | Tuple[str, Any]) -> Any | None:
        path, default = path if isinstance(path, tuple) else (path, None)
        item = self.source
        for key in DictIndexer._split(path):
            item = item.get(key, None)
            if item is None:
                return default
        else:
            return item

    def __setitem__(self, path: str, value: Any):
        class Missing:
            pass

        names = DictIndexer._split(path)
        prev = self.source
        item = self.source
        for key in names[:-1]:
            item = item.get(key, Missing())
            match item:
                case dict():
                    prev = item
                case Missing():
                    if self.set_creates_branches:
                        prev[key] = {}
                    else:
                        raise KeyError(f"Cannot create branch '{key}. Disabled")
                case _:
                    raise KeyError(f"Cannot overwrite '{key}.")
        else:
            item[names[-1]] = value

    @staticmethod
    def _split(path: str) -> List[str]:
        return re.split(r"[/\\]", path)

Tests

With pytest I made sure that the following features work:


def test_can_get_item_by_path():
    data = {
        "foo": {
            "bar": "baz"
        }
    }

    assert DictIndexer(data)["foo/bar"] == "baz"


def test_gets_null_when_last_name_does_not_exist():
    data = {
        "foo": {
            "bar": "baz"
        }
    }

    assert DictIndexer(data)["foo/baz"] is None


def test_gets_null_or_default_when_middle_name_does_not_exist():
    data = {
        "foo": {
            "bar": {
                "baz": "baz"
            }
        }
    }

    assert DictIndexer(data)["foo/baz/baz"] is None
    assert DictIndexer(data)[("foo/baz/baz", "qux")] == "qux"


def test_can_set_item():
    data = {
        "foo": {
            "bar": {
                "baz": "baz"
            }
        }
    }

    DictIndexer(data)["foo/bar/baz"] = "qux"
    assert DictIndexer(data)["foo/bar/baz"] == "qux"


def test_can_set_item_on_new_branches():
    data = {
        "foo": {
            "bar": {
                "baz": {
                    "foo": "bar"
                }
            }
        }
    }

    DictIndexer(data)["foo/bar/baz/qux"] = "qux"
    assert DictIndexer(data)["foo/bar/baz/qux"] == "qux"


def test_does_not_create_new_branches_when_disabled():
    data = {
        "foo": {
            "bar": {

            }
        }
    }

    with pytest.raises(KeyError):
        DictIndexer(data, set_creates_branches=False)["foo/bar/baz/qux"] = "qux"


def test_does_not_overwrite_value_with_dict():
    data = {
        "foo": {
            "bar": {
                "baz": "baz"
            }
        }
    }

    with pytest.raises(KeyError):
        DictIndexer(data)["foo/bar/baz/qux"] = "qux"

Example

I currently use it to customize the logger cofiguration:

def _initialize_logging():
    import logging.config
    config = DictIndexer(assets.cfg("wiretap"))
    config["formatters/auto/()"] = wiretap.MultiFormatter
    config["formatters/auto/./values"] = [f"v{app.VERSION}-{os.environ['mode'][0]}"]
    config["handlers/file/filename"] = assets.home("log", f"{app.ID}.log")
    config["handlers/sqlserver/insert"] = assets.sql("wiretap", "insert log.sql")
    logging.config.dictConfig(config.source)

Why do you think? What would you improve?

You offer both {getter, setter}, cool! FYI, here is a nice package focused on the getter use case: pypi.org/project/glom — J_H
– J_H, Commented Mar 29, 2023 at 16:30

SylvainD · Accepted Answer · 2023-03-29 19:13:25Z

Overview

The code looks good, properly typed, tested. This is a great starting point.

Here are various details can could be improved.

Documentation

Your code could definitely be improved with docstrings.

Imports

This is more about the StackExchange question than about the code itself but providing the imports can make testing/reviewing easier for others.

from typing import Dict, Any, Tuple, List
import re

Tests: splitting the setup from the logic being tested

This is mostly a matter of personal preference but I find a line such as assert DictIndexer(data)[("foo/baz/baz", "qux")] == "qux" a bit hard to grasp. The test cases can probably be made easier to read if the DictIndexer was created as an initial step and then used to perform queries.

We'd get something like:

def test_gets_null_or_default_when_middle_name_does_not_exist():
    data = DictIndexer({
        "foo": {
            "bar": {
                "baz": "baz"
            }
        }
    })

    assert data["foo/baz/baz"] is None
    assert data[("foo/baz/baz", "qux")] == "qux"

This is even more relevant when with pytest.raises(KeyError) is used as you definitely do not want the constructor of DictIndexer to be the part raising the error.

Splitting

Your method to split relies on regexp. It seems slightly overkill in this particular case as you are using a single character for splitting and str.split would do the trick.

Also, you could make your code slightly more generic by providing the separator in the constructor.

    def __init__(self, source: Dict[str, Any], set_creates_branches: bool = True, separator='/'):
        self.source = source
        self.set_creates_branches = set_creates_branches
        self.separator = separator

    (...)

    def _split(self, path: str) -> List[str]:
        return path.split(self.separator)

Getting elements with default value

I find the way the default value can be provided to __getitem__ slightly awkward. My preference would be to rely on how things are done for dict:

d[key]: Return the item of d with key key. Raises a KeyError if key is not in the map.
get(key[, default]): Return the value for key if key is in the dictionary, else default. If default is not given, it defaults to None, so that this method never raises a KeyError.

    def get(self, path, default=None):
        try:
            return self[path]
        except KeyError:
            return default

    def __getitem__(self, path: str) -> Any | None:
        item = self.source
        for key in self._split(path):
            item = item.get(key, None)
            if item is None:
                raise KeyError(f"Key '{key} not found.")
        else:
            return item

else for for loop

The optional else for for loops is a nice idiom which deserves some love and can be used to perform an operation in the case where break did not occur in the loop and the loops stops in a "normal" way.

In our particular case, there is no break involved at all so I'm not so sure that it is relevant.

Tuple unpacking using asterisks

Instead of messing with indices with names in __setitem__, you could directly unpack it into the parts you are interested in: *names, last_name = self._split(path).

Final code

At this stage, the code looks like this:

from typing import Dict, Any, Tuple, List

class DictIndexer:
    def __init__(self, source: Dict[str, Any], set_creates_branches: bool = True, separator='/'):
        self.source = source
        self.set_creates_branches = set_creates_branches
        self.separator = separator

    def get(self, path, default=None):
        try:
            return self[path]
        except KeyError:
            return default

    def __getitem__(self, path: str) -> Any | None:
        item = self.source
        for key in self._split(path):
            item = item.get(key, None)
            if item is None:
                raise KeyError(f"Key '{key} not found.")
        return item

    def __setitem__(self, path: str, value: Any):
        class Missing:
            pass

        *names, last_name = self._split(path)
        prev = self.source
        item = self.source
        for key in names:
            item = item.get(key, Missing())
            match item:
                case dict():
                    prev = item
                case Missing():
                    if self.set_creates_branches:
                        prev[key] = {}
                    else:
                        raise KeyError(f"Cannot create branch '{key}. Disabled")
                case _:
                    raise KeyError(f"Cannot overwrite '{key}.")
        item[last_name] = value

    def _split(self, path: str) -> List[str]:
        return path.split(self.separator)


def test_can_get_item_by_path():
    data = DictIndexer({
        "foo": {
            "bar": "baz"
        }
    })
    assert data["foo/bar"] == "baz"


def test_gets_null_when_last_name_does_not_exist():
    data = DictIndexer({
        "foo": {
            "bar": "baz"
        }
    })
    try:
        data["foo/baz"]
    except KeyError:
        pass
    assert data.get("foo/baz") is None 



def test_gets_null_or_default_when_middle_name_does_not_exist():
    data = DictIndexer({
        "foo": {
            "bar": {
                "baz": "baz"
            }
        }
    })
    try:
        data["foo/baz/baz"]
    except KeyError:
        pass
    assert data.get("foo/baz/baz", "qux") == "qux"


def test_can_set_item():
    data = DictIndexer({
        "foo": {
            "bar": {
                "baz": "baz"
            }
        }
    })

    assert data["foo/bar/baz"] != "qux"
    data["foo/bar/baz"] = "qux"
    assert data["foo/bar/baz"] == "qux"


def test_can_set_item_on_new_branches():
    data = DictIndexer({
        "foo": {
            "bar": {
                "baz": {
                    "foo": "bar"
                }
            }
        }
    })

    data["foo/bar/baz/qux"] = "qux"
    assert data["foo/bar/baz/qux"] == "qux"


def test_does_not_create_new_branches_when_disabled():
    data = DictIndexer({
        "foo": {
            "bar": {

            }
        }
    }, set_creates_branches=False)

    try:
        data["foo/bar/baz/qux"] = "qux"
    except KeyError:
        pass



def test_does_not_overwrite_value_with_dict():
    data = DictIndexer({
        "foo": {
            "bar": {
                "baz": "baz"
            }
        }
    })

    try:
        data["foo/bar/baz/qux"] = "qux"
    except KeyError:
        pass

Edit

Something that I had forgotten: now that __getitem__ throws, one could directly use:

    def __getitem__(self, path: str) -> Any | None:
        item = self.source
        for key in self._split(path):
            item = item[key]
        return item

Similarly, we could imagine getting rid of the Missing class and work with exceptions directly:

    def __setitem__(self, path: str, value: Any):
        *names, last_name = self._split(path)
        prev = self.source
        item = self.source
        for key in names:
            try:
                item = item[key]
            except KeyError:
                if self.set_creates_branches:
                    prev[key] = {}
                    continue
                else:
                    raise KeyError(f"Cannot create branch '{key}. Disabled")
            match item:
                case dict():
                    prev = item
                case _:
                    raise KeyError(f"Cannot overwrite '{key}.")
        item[last_name] = value

These are great advices! I've learnt some new tricks and better practices. Thanks! — t3chb0t
– t3chb0t, Commented Mar 28, 2023 at 8:56
@t3chb0t I'm glad you like it! FYI, I've added an additional suggestion. — SylvainD
– SylvainD, Commented Mar 28, 2023 at 9:58
I like it very much. Code must be like art and now it is ;-] — t3chb0t
– t3chb0t, Commented Mar 28, 2023 at 16:07

AJNeufeld · Accepted Answer · 2023-03-29 16:23:38Z

Bug

I'm not liking some code that I see in __setitem__, so let's run a few more tests:

>>> data = {"foo": {}}
>>> DictIndexer(data)["foo/bar/baz/qux"] = "qux"
Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    DictIndexer(data)["foo/bar/baz/qux"] = "qux"
  File "C:/Users/aneufeld/Documents/Experiments/DictIndex.py", line 27, in __setitem__
    item = item.get(key, Missing())
AttributeError: 'Missing' object has no attribute 'get'

Yup. Creating more than one missing path segment is definitely a problem.

>>> data
{'foo': {'bar': {}}}

It has only created the first missing key & dictionary. So let's try a shorter test:

>>> data = {"foo": {}}
>>> DictIndexer(data)["foo/bar/qux"] = "qux"
Traceback (most recent call last):
  File "<pyshell#8>", line 1, in <module>
    DictIndexer(data)["foo/bar/qux"] = "qux"
  File "C:/Users/aneufeld/Documents/Experiments/DictIndex.py", line 39, in __setitem__
    item[names[-1]] = value
TypeError: 'Missing' object does not support item assignment

Interesting. That fails in a slightly different way.

Let's look at the code:

        prev = self.source
        item = self.source
        for key in names[:-1]:
            item = item.get(key, Missing())
            match item:
                case dict():
                    prev = item
                case Missing():
                    if self.set_creates_branches:
                        prev[key] = {}
                    else:
                        raise KeyError(f"Cannot create branch '{key}. Disabled")
                case _:
                    raise KeyError(f"Cannot overwrite '{key}.")

Yup. We initializes prev, then we have a loop the update the value of prev at each iteration ... except when a missing key is encountered, something can be done and the loop goes on to the next iteration without prev being updated.

Wait ... no ... the error messages are talking about the Missing object! This is not the bug I was expecting; it is an entirely different bug!

item gets left as Missing() and,

if it was the last iteration we can't set the final key value into item,
if it wasn't the last iteration, we can't look up the next path key from item.

When you create a new branch, you need to update prev to the newly created dictionary (the bug I thought I saw), but you also need to set item to that same newly created dictionary ...

            ...
            item = item.get(key, Missing())   # Intended action: update item
            match item:
                case dict():
                    prev = item               # Prep for next loop
                case Missing():
                    if self.set_creates_branches:
                        prev[key] = {}        # Create missing dictionary
                        item = prev.get(key)  # <-- Fix: Replicate intended action
                        prev = item           # <-- Fix: prep for next loop
                    ...

We can re-work this to remove inefficiencies. Note that before and after this code block, prev and item are the same, so we can move prev entirely inside the loop.

        item = self.source
        for key in names[:-1]:
            prev = item
            item = item.get(key, Missing())
            match item:
                case dict():
                    pass
                case Missing():
                    if self.set_creates_branches:
                        item = prev[key] = {}
                    else:
                        raise KeyError(f"Cannot create branch '{key}. Disabled")
                case _:
                    raise KeyError(f"Cannot overwrite '{key}.")
        else:
            item[names[-1]] = value

Testing

Tests are good! I like testing!

But, are you really testing the right things?

def test_can_set_item():
    data = {
        "foo": {
            "bar": {
                "baz": "baz"
            }
        }
    }

    DictIndexer(data)["foo/bar/baz"] = "qux"
    assert DictIndexer(data)["foo/bar/baz"] == "qux"

Have any other keys/subkeys been added to data? Any been accidentally removed? Yes, the key you are setting is tested to verify getting it returns the value you just set, but is that the only effect?

    assert data == {"foo": {"bar": {"baz": "qux"}}}

Now we're testing data is properly modified.

This review contains solid advice that the OP would be wise to follow: when testing, don't assert so narrowly that you effectively assume your code's correctness. I re-learned this lesson recently. A small project I'm working on had accumulated a couple dozen tests that were too repetitive. I wrote an assertion-making helper function that most of the tests were able to use. I thought I was just getting rid of code repetition, but there was an unanticipated side effect: for several tests, use of the helper widened their assertion-making considerably – exposing a couple of edge-case bugs! — FMc
– FMc, Commented Mar 29, 2023 at 16:45

Peilonrayz · Accepted Answer · 2023-03-29 00:24:57Z

Your __getitem__ function indicates two conflicting ideas.

You call out __getitem__ can return None.
```
def __getitem__(...) -> Any | None:
```

Your function cannot return None unless default is None.

item = item.get(key, None)
if item is None:
    return default

>>> DictIndexer({"a": None})["a", "default value"]
default value

Because of the code in the second item, your class doesn't work in a similar way to dict.get. A fairly large Python idiom is EAFP. To easily get the same interface as dict.get is really simple with EAFP.

>>> {"a": None}.get("a", "default value")
None

def __getitem__(self, path: str | Tuple[str, Any]) -> Any | None:
    path, default = path if isinstance(path, tuple) else (path, None)
    try:
        item = self.source
        for key in self._split(path):
            item = item[key]
        return item
    except KeyError:
        return default

I can also see a potential for conflicting ideas on type hints - Any | None is Any. I did things like Any | None when I wanted to make the class generic but didn't fully know how. So I'm going to go through an over-kill entirely generic class.

First we can define a recursive generic alias for a dictionary of dictionaries or values. Basically the dict tree we're interacting with.

TKey = TypeVar("TKey")
TValue = TypeVar("TValue")

ATree: TypeAlias = dict[TKey, TValue] | TValue | dict[TKey, "ATree[TKey, TValue]"]

From here we can define a generic class which takes a Tree as input. And we'll define _split's type hints to remove errors with mypy.

class DictIndexer(Generic[TKey, TValue]):
    def __init__(self, source: Tree[TKey, TValue]) -> None:
        self.source = source

    @classmethod
    def _split(cls, path: TKey) -> Iterable[TKey]:
        raise NotImplementedError(f"{cls.__name__}._split has not been defined")

And then finally use overload to completely type __getitem__. There are two paths:

no default specified so the type is None, or
a default type is provided which we should union with.

TDefault = TypeVar("TDefault")

...

    @overload
    def __getitem__(self, path: TKey | tuple[TKey]) -> TValue | None: ...

    @overload
    def __getitem__(self, path: tuple[TKey, TDefault]) -> TValue | TDefault: ...

    def __getitem__(self, path: Any) -> Any:

From here we can see the generic types for DictIndexer are inferred by the dictionary literal. And we can see __getitem__ correctly sets the correct return types.

tree = DictIndexer({"a": None})
reveal_type(tree)
reveal_type(tree["a"])
reveal_type(tree["a", "default value"])

>>> mypy foo.py
foo.py:36: note: Revealed type is "foo.DictIndexer[builtins.str, None]"
foo.py:37: note: Revealed type is "None"
foo.py:38: note: Revealed type is "Union[None, builtins.str]"

Your __setitem__ is far less EAFP than __getitem__. If we ignore the "able to create new branches, but only when it's a clean branch" part we can make a very simple EAFP function.

Lets look at the edge-cases:

```
>>> {}["a"]
KeyError: 'a'
```

>>> None["a"]
TypeError: 'NoneType' object is not subscriptable

So we can quite easily rewrite __setitem__ in the exact same way we did with __getitem__.

def __setitem__(self, path: TKey, value: TValue) -> None:
    *head, tail = key, *_ = self._split(path)
    try:
        item = self.source
        for key in head:
            item = item[key]
    except KeyError:
        raise KeyError(f"Cannot create branch '{key}. Disabled") from None
    except TypeError:
        raise KeyError(f"Cannot overwrite '{key}.") from None
    else:
        item[tail] = value

Now we can change __setitem__ to support the dict.setdefault like behaviour.

try:
    item = self.source
    if not self.set_creates_branches:
        for key in head:
            item = item[key]
    else:
        for key in head:
            try:
                item = item[key]
            except KeyError:
                item[key] = item = {}
except KeyError:

from typing import Any, Generic, Iterable, TypeAlias, TypeVar, overload

TKey = TypeVar("TKey")
TValue = TypeVar("TValue")
TDefault = TypeVar("TDefault")

ATree: TypeAlias = dict[TKey, TValue] | TValue | dict[TKey, "ATree[TKey, TValue]"]


class BaseDictIndexer(Generic[TKey, TValue]):
    def __init__(
        self,
        source: ATree[TKey, TValue],
        set_creates_branches: bool = True,
    ) -> None:
        self.source = source
        self.set_creates_branches = set_creates_branches

    @classmethod
    def _split(cls, path: TKey) -> Iterable[TKey]:
        raise NotImplementedError(f"{cls.__name__}._split has not been defined")

    @overload
    def __getitem__(self, path: TKey | tuple[TKey]) -> TValue | None: ...

    @overload
    def __getitem__(self, path: tuple[TKey, TDefault]) -> TValue | TDefault: ...

    def __getitem__(self, path: Any) -> Any:
        path, default = path if isinstance(path, tuple) else (path, None)
        try:
            item = self.source
            for key in DictIndexer._split(path):
                item = item[key]  # type: ignore
            return item
        except KeyError:
            return default

    def __setitem__(self, path: TKey, value: TValue) -> None:
        *head, tail = key, *_ = self._split(path)
        try:
            item = self.source
            if not self.set_creates_branches:
                for key in head:
                    item = item[key]  # type: ignore
            else:
                for key in head:
                    try:
                        item = item[key]  # type: ignore
                    except KeyError:
                        item[key] = item = {}  # type: ignore
        except KeyError:
            raise KeyError(f"Cannot create branch '{key}'. Disabled") from None
        except TypeError:
            raise KeyError(f"Cannot overwrite '{key}'.") from None
        else:
            item[tail] = value  # type: ignore


class DictIndexer(BaseDictIndexer[str, TValue]):
    @staticmethod
    def _split(path: str) -> list[str]:
        return path.split("/")


# tree = DictIndexer({"a": None}, set_creates_branches=False)
tree = DictIndexer({"a": None})
print(tree["a"])
tree["b/c/d"] = None
print(tree["b/c/d"])
print(tree.source)

This is rocket-science for me, but I now see how powerful generics can be so I'll dig into that. — t3chb0t
– t3chb0t, Commented Mar 29, 2023 at 11:03
I'm studying your suggestions and I'm currently stuck on Tree[TKey, TValue] that I wasn't able to crack. You defined it as a TypeAlias (with a typo as ATree) from what I understand it can be either one of the tree types separated by the pipe |. What I don't understand is why is Tree taking two generic arguments? Is it because you use the first of the types in the list this is dict[TKey, TValue] as the source or is it becuase the types within the alias use generic types so you need to specify them in the declaration. I tend to think this is the latter. — t3chb0t
– t3chb0t, Commented Apr 1, 2023 at 10:06
Then in the last type Dict[TKey, "Tree[TKey, TValue]"] you use a string. I guess this is because of the recursion or self-reference. I'm typing it step by step in PyCharm and it recognizes this correctly. I think I've found it in the docs. It's called a forward reference, right? — t3chb0t
– t3chb0t, Commented Apr 1, 2023 at 10:13
1. I prefix all TypeAlias' with A. I also prefix Protocols with I (like C#) and suffix ABCs with Base. 2. Lets pretend ATree is ATree = dict[TKey, TValue]. When we're using ATree we're effectively just using dict. So when you pass the types to dict you need both the key and the value types. So because we've defined two generic types in the type alias we need to specify all the generics. -- An interesting side note is if you ATree = TValue | dict ... the arguments to ATree will be flipped to ATree[TValue, TKey]. Yes the string is a forward reference. — Peilonrayz
– Peilonrayz ♦, Commented Apr 1, 2023 at 16:39
@t3chb0t 4. Multiple assignment acts from left to right. item[key] = item = {} assigns item[key] first, then assigns item. If you instead do item = item[key] = {} then you're doing item = {}, item[key] = item, which should obviously become the infinitely recursive dict you see. 3. Your result doesn't seem right, I get head=[1, 2, 3, 4, 5], tail=6, key=1, x=[2, 3, 4, 5, 6]. Notice tail=6 not tail=1. As mentioned before multiple assignments act right to left. Additionally the resulting object in tuple unpacking is the entire tuple. We haven't sliced the tuple. — Peilonrayz
– Peilonrayz ♦, Commented Apr 1, 2023 at 16:48

Stack Exchange Network

Getting and setting dict items via paths

Implementation

Tests

Example

3 Answers 3

Bug

Testing

You must log in to answer this question.

Hot Network Questions

Getting and setting dict items via paths

Implementation

Tests

Example

3 Answers 3

Bug

Testing

You must log in to answer this question.

Related

Hot Network Questions