What sparse object should be used for indexing via coordinates and efficient summing across dimensions?

Question

I'd like to find some container object to contain scalar values (doesn't really matter whether integral or fractional or floating point types). The following snippet roughly outlines the usecases and the criteria it needs to fulfill. In this snipped I'm using a numpy array, but I'd like to avoid that as it needs a lot of memory, because it will only be sparsely filled. But other than that, it fulfills my requirements.

The object should hold scalar values, that are indexed by a d-dimensional index with values of 0,1,2,...,n-1. (Imagine something like 5 <= d <= 100, 20 <= n <= 200).
~~The object should be mutable, i.e. the values need to be updatable.~~ (Edit: Not actually necessary, see discussion below.)
All possible indices should be initialized with zero at the start. That is, all indices that have not been accessed should implicitly assumed to hold zero.
It should be efficient to sum across one or multiple dimensions.

Is there built in python object that satisfies this, or is there a data structure that can effciently implemented in python?

So far I have found the scipy COO arrays, that satisfy some of the criteria, but they only support 1- and 2-d indexing.

For context: The idea is building a frequency table of certain objects, and use this then as a distribution to sample from, marginalize, etc.

import numpy as np
# just some placeholder data
data = [((1,0,5),6),((2,6,5),100),((5,3,1),1),((2,0,5),4),((2,6,5),100)]

# structure mapping [0,n]^d coordinates to scalars
data_structure = np.zeros((10, 10, 10))

# needs to be mutable #EDIT: doesn't need to be, see discussion below
for coords, value in data:
    data_structure[coords] += value  #EDIT: can be done as preprocessing step

# needs to be able to efficiently sum across dimensions
for k in range(100):
    x = data_structure.sum(axis=(1,2))
    y = data_structure.sum(axis=(0,))
    z = data_structure.sum(axis=(0,2))

scipy.sparse.coo_matrix is convenient for data definition, but isn't used for indexing or computation. But scipy efficiently converts it to csr format. lil and dok are good for iterative assignment. dok uses python dict under the covers. All these are 2d — hpaulj
– hpaulj, Commented Apr 14, 2024 at 14:10
In the sparse array world sparsity (% of nonzero values) on the order of 5% or less is best. — hpaulj
– hpaulj, Commented Apr 14, 2024 at 14:13
Does adding/removing new values need to be fast? Does it need to be fast at summing over arbitrary axes, or can it be optimized for a particular axis or set of axes a priori? — Matt Haberland
– Matt Haberland, Commented Apr 14, 2024 at 16:24
Part of why scipy.sparse.coo does not implement indexing is that coordinates can be specified in any order. So finding any one index (which may be 0) is inefficient. When converted to CSR, points are sorted, by row, and within rows by column. Duplicates are summed. A lot of the CSR indexing is actually implmented via matrix multiplication. So is row or column summation. Efficient CSR calculation methods were developed years ago by mathematicians working on large linear equation systems. There is a [sparse-matrix] tag if you want to explore these topics more. — hpaulj
– hpaulj, Commented Apr 14, 2024 at 17:18
Do I understand correctly that the scipy.sparse types are anyway out of question, as they all only implement matrices (2D-arrays)? — flawr
– flawr, Commented Apr 14, 2024 at 18:20

Andras Deak -- Слава Україні · Accepted Answer · 2024-04-15 15:57:09Z

2

Although I'm a huge fan of NumPy, my first approach to this problem would be to use native types: a custom class that keeps track of a dict where the keys are nonzero index tuples and values are the corresponding values (the custom class would mainly be responsible for a nicer API).

Indexing individual values is trivial and O(1). Marginalization smells O(N) if you loop over items and accumulate values for the resulting object. Memory access of values used for a given sum will not be efficient, but hopefully with enough sparsity this is not the determining factor.

What would be problematic is something like indexing into a single axis and gathering all values along remaining dimensions, as in arr[:, i, ...]. If this is something you need to handle then I'd try keeping track of an auxiliary list of sorted keys (index tuples), with which binary search (or something similar) might be used to find values that belong to a certain value along a certain index. Assuming that's faster than just looping over all keys and finding the indices that match the given value along the given axis.

answered Apr 15, 2024 at 15:57

Andras Deak -- Слава Україні

35.4k13 gold badges94 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Cris Luengo Over a year ago

If searching is needed, then I would opt for a spatial indexing data structure, for example an R*-tree.

Collectives™ on Stack Overflow

What sparse object should be used for indexing via coordinates and efficient summing across dimensions?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related