1

I'd like to find some container object to contain scalar values (doesn't really matter whether integral or fractional or floating point types). The following snippet roughly outlines the usecases and the criteria it needs to fulfill. In this snipped I'm using a numpy array, but I'd like to avoid that as it needs a lot of memory, because it will only be sparsely filled. But other than that, it fulfills my requirements.

  • The object should hold scalar values, that are indexed by a d-dimensional index with values of 0,1,2,...,n-1. (Imagine something like 5 <= d <= 100, 20 <= n <= 200).
  • The object should be mutable, i.e. the values need to be updatable. (Edit: Not actually necessary, see discussion below.)
  • All possible indices should be initialized with zero at the start. That is, all indices that have not been accessed should implicitly assumed to hold zero.
  • It should be efficient to sum across one or multiple dimensions.

Is there built in python object that satisfies this, or is there a data structure that can effciently implemented in python?

So far I have found the scipy COO arrays, that satisfy some of the criteria, but they only support 1- and 2-d indexing.

For context: The idea is building a frequency table of certain objects, and use this then as a distribution to sample from, marginalize, etc.

import numpy as np
# just some placeholder data
data = [((1,0,5),6),((2,6,5),100),((5,3,1),1),((2,0,5),4),((2,6,5),100)]

# structure mapping [0,n]^d coordinates to scalars
data_structure = np.zeros((10, 10, 10))

# needs to be mutable #EDIT: doesn't need to be, see discussion below
for coords, value in data:
    data_structure[coords] += value  #EDIT: can be done as preprocessing step

# needs to be able to efficiently sum across dimensions
for k in range(100):
    x = data_structure.sum(axis=(1,2))
    y = data_structure.sum(axis=(0,))
    z = data_structure.sum(axis=(0,2))
9
  • 1
    scipy.sparse.coo_matrix is convenient for data definition, but isn't used for indexing or computation. But scipy efficiently converts it to csr format. lil and dok are good for iterative assignment. dok uses python dict under the covers. All these are 2d Commented Apr 14, 2024 at 14:10
  • In the sparse array world sparsity (% of nonzero values) on the order of 5% or less is best. Commented Apr 14, 2024 at 14:13
  • Does adding/removing new values need to be fast? Does it need to be fast at summing over arbitrary axes, or can it be optimized for a particular axis or set of axes a priori? Commented Apr 14, 2024 at 16:24
  • Part of why scipy.sparse.coo does not implement indexing is that coordinates can be specified in any order. So finding any one index (which may be 0) is inefficient. When converted to CSR, points are sorted, by row, and within rows by column. Duplicates are summed. A lot of the CSR indexing is actually implmented via matrix multiplication. So is row or column summation. Efficient CSR calculation methods were developed years ago by mathematicians working on large linear equation systems. There is a [sparse-matrix] tag if you want to explore these topics more. Commented Apr 14, 2024 at 17:18
  • Do I understand correctly that the scipy.sparse types are anyway out of question, as they all only implement matrices (2D-arrays)? Commented Apr 14, 2024 at 18:20

1 Answer 1

2

Although I'm a huge fan of NumPy, my first approach to this problem would be to use native types: a custom class that keeps track of a dict where the keys are nonzero index tuples and values are the corresponding values (the custom class would mainly be responsible for a nicer API).

Indexing individual values is trivial and O(1). Marginalization smells O(N) if you loop over items and accumulate values for the resulting object. Memory access of values used for a given sum will not be efficient, but hopefully with enough sparsity this is not the determining factor.

What would be problematic is something like indexing into a single axis and gathering all values along remaining dimensions, as in arr[:, i, ...]. If this is something you need to handle then I'd try keeping track of an auxiliary list of sorted keys (index tuples), with which binary search (or something similar) might be used to find values that belong to a certain value along a certain index. Assuming that's faster than just looping over all keys and finding the indices that match the given value along the given axis.

Sign up to request clarification or add additional context in comments.

1 Comment

If searching is needed, then I would opt for a spatial indexing data structure, for example an R*-tree.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.